"In this jupyter notebook we will focus on how to preprocess your data. We will apply the techniques discussed in the presentation. The codes can be copied and paste if you want to use the code for your own data.\n",
"In this jupyter notebook we will focus on how to preprocess your data. We will apply the techniques discussed in the presentation. The codes can be copied and pasted if you want to use the code for your own data.\n",
"\n",
"We will work with the open data set *Breast Cancer Wisconsin (Diagnostic) Data Set*. For more information about the data set: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29"
]
...
...
@@ -358,7 +358,7 @@
"source": [
"This plot shows:\n",
"- The dark features show where there is no correlation\n",
"- The bright features do show a correlation. Features with high correlations contain both contain the same information\n",
"- The bright features do show a correlation. Features with high correlations both contain the same information\n",
"- On the diagonal you see the correlation of the feature with itself which is, naturally, always 1.0"
]
},
...
...
%% Cell type:markdown id: tags:
# Introduction data preprocessing
In this jupyter notebook we will focus on how to preprocess your data. We will apply the techniques discussed in the presentation. The codes can be copied and paste if you want to use the code for your own data.
In this jupyter notebook we will focus on how to preprocess your data. We will apply the techniques discussed in the presentation. The codes can be copied and pasted if you want to use the code for your own data.
We will work with the open data set *Breast Cancer Wisconsin (Diagnostic) Data Set*. For more information about the data set: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
%% Cell type:code id: tags:
``` python
# Load the python packages we will use in the following cell blocks
importpandasaspd# data processing
importnumpyasnp# linear algebra
importmatplotlib.pyplotasplt# data vizualisation
importseabornassns# data vizualisation
```
%% Cell type:markdown id: tags:
## Data
The data is found in the file 'data.csv' in the current folder. The following information is given by the source of the data:
1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)
Ten real-valued features are computed for each cell nucleus
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
# create an overview of the data and their statistics using the 'describe' function of pandas
# this only works for numeric data
data.describe().T
```
%% Cell type:markdown id: tags:
What do you notice?
- All the data is numeric except for the label column, which is a category
- There is an empty column *Unnamed 32*, so we will drop this column
- The rest of the columns have a minimum value of -999
## TODO
**What do you think the value -999 means?**
%% Cell type:markdown id: tags:
Your answer:
%% Cell type:markdown id: tags:
## TODO
**Check the data and look into these values. How many of these values are present per column? What to do with the -999 values in the diagnosis column?**
**Replace the values with nan's or delete them and show the new statistics. We will later impute these values if necessary**
%% Cell type:code id: tags:
``` python
## You code
```
%% Cell type:code id: tags:
``` python
# drop empty column
data=data.drop(labels=['Unnamed: 32'],axis=1)
# update columns
col=data.columns
plot_col=col[2:]
```
%% Cell type:markdown id: tags:
### Outliers
Check for outliers using a boxplots.
%% Cell type:code id: tags:
``` python
# box plot for one feature
sns.boxplot(x=plot_col[0],data=data,orient='v')
```
%% Cell type:markdown id: tags:
## TODO
**Look into the data using boxplots to find outliers. Update the dataframe accordingly and redo the boxplots.**
%% Cell type:code id: tags:
``` python
## Your code
```
%% Cell type:markdown id: tags:
### Distributions per category
Now we will look into the distributions of the two labels. This will give us an idea of usefull features
%% Cell type:code id: tags:
``` python
# Let's look at the first 5 features as a function of the diagnosis
A heatmap of the correlations can give insight in relations between features. Using the corr() function from pandas we can easily calculate the correlations.
However, notice that this is only the case for linear correlations!
- The dark features show where there is no correlation
- The bright features do show a correlation. Features with high correlations contain both contain the same information
- The bright features do show a correlation. Features with high correlations both contain the same information
- On the diagonal you see the correlation of the feature with itself which is, naturally, always 1.0
%% Cell type:markdown id: tags:
## Impute NaN's
Here we will try to impute the NaN's in the data set. One of the columns with a lot of NaN's is the column *texture_mean*. We will take a look into the possibilities with this columns.
%% Cell type:code id: tags:
``` python
# replace the -999 values if not done yet
data=data.replace(-999,np.nan)
data=data.dropna(how='all',axis=0)
print("Number of NaN's: ",data['texture_mean'].isnull().sum())