Commit cb4c7ca9 authored by Schinkelshoek's avatar Schinkelshoek
Browse files

Update 1_preprocess_data.ipynb

parent 905c5514
......@@ -6,7 +6,7 @@
"source": [
"# Introduction data preprocessing\n",
"\n",
"In this jupyter notebook we will focus on how to preprocess your data. We will apply the techniques discussed in the presentation. The codes can be copied and paste if you want to use the code for your own data.\n",
"In this jupyter notebook we will focus on how to preprocess your data. We will apply the techniques discussed in the presentation. The codes can be copied and pasted if you want to use the code for your own data.\n",
"\n",
"We will work with the open data set *Breast Cancer Wisconsin (Diagnostic) Data Set*. For more information about the data set: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29"
]
......@@ -358,7 +358,7 @@
"source": [
"This plot shows:\n",
"- The dark features show where there is no correlation\n",
"- The bright features do show a correlation. Features with high correlations contain both contain the same information\n",
"- The bright features do show a correlation. Features with high correlations both contain the same information\n",
"- On the diagonal you see the correlation of the feature with itself which is, naturally, always 1.0"
]
},
......
%% Cell type:markdown id: tags:
# Introduction data preprocessing
In this jupyter notebook we will focus on how to preprocess your data. We will apply the techniques discussed in the presentation. The codes can be copied and paste if you want to use the code for your own data.
In this jupyter notebook we will focus on how to preprocess your data. We will apply the techniques discussed in the presentation. The codes can be copied and pasted if you want to use the code for your own data.
We will work with the open data set *Breast Cancer Wisconsin (Diagnostic) Data Set*. For more information about the data set: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
%% Cell type:code id: tags:
``` python
# Load the python packages we will use in the following cell blocks
import pandas as pd # data processing
import numpy as np # linear algebra
import matplotlib.pyplot as plt # data vizualisation
import seaborn as sns # data vizualisation
```
%% Cell type:markdown id: tags:
## Data
The data is found in the file 'data.csv' in the current folder. The following information is given by the source of the data:
1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)
Ten real-valued features are computed for each cell nucleus
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
%% Cell type:code id: tags:
``` python
# read csv file and print header
data = pd.read_csv('data_raw.csv')
data.head()
```
%% Cell type:code id: tags:
``` python
# feature names in the csv file
col = data.columns
print(col)
```
%% Cell type:markdown id: tags:
# EDA and clean
First we will look into the data and see what we have. We will do this by using various plots.
In the end we will want to predict the column *diagnosis* for these patients, thus this will be the label.
%% Cell type:code id: tags:
``` python
# What is the distribution of the label
sns.countplot(x='diagnosis', data=data)
print('Number of times diagnosis appears:')
B, M, _ = data['diagnosis'].value_counts()
print('Benign: {0}, ({1:.2f}%)'.format(B, B/len(data) * 100))
print('Malignant: {0}, ({1:.2f}%)'.format(M, M/len(data) * 100))
```
%% Cell type:code id: tags:
``` python
## what kinds of data do we have
data.info()
```
%% Cell type:code id: tags:
``` python
# create an overview of the data and their statistics using the 'describe' function of pandas
# this only works for numeric data
data.describe().T
```
%% Cell type:markdown id: tags:
What do you notice?
- All the data is numeric except for the label column, which is a category
- There is an empty column *Unnamed 32*, so we will drop this column
- The rest of the columns have a minimum value of -999
## TODO
**What do you think the value -999 means?**
%% Cell type:markdown id: tags:
Your answer:
%% Cell type:markdown id: tags:
## TODO
**Check the data and look into these values. How many of these values are present per column? What to do with the -999 values in the diagnosis column?**
**Replace the values with nan's or delete them and show the new statistics. We will later impute these values if necessary**
%% Cell type:code id: tags:
``` python
## You code
```
%% Cell type:code id: tags:
``` python
# drop empty column
data=data.drop(labels=['Unnamed: 32'],axis=1)
# update columns
col = data.columns
plot_col = col[2:]
```
%% Cell type:markdown id: tags:
### Outliers
Check for outliers using a boxplots.
%% Cell type:code id: tags:
``` python
# box plot for one feature
sns.boxplot(x=plot_col[0], data=data, orient='v')
```
%% Cell type:markdown id: tags:
## TODO
**Look into the data using boxplots to find outliers. Update the dataframe accordingly and redo the boxplots.**
%% Cell type:code id: tags:
``` python
## Your code
```
%% Cell type:markdown id: tags:
### Distributions per category
Now we will look into the distributions of the two labels. This will give us an idea of usefull features
%% Cell type:code id: tags:
``` python
# Let's look at the first 5 features as a function of the diagnosis
plot_col = col[2:]
fig, axs = plt.subplots(5,1, figsize=(6,15))
for i in range(5):
axs[i].set_title(plot_col[i])
sns.distplot(data.loc[data['diagnosis'] == 'M', plot_col[i]], label='M', ax=axs[i], norm_hist = False, hist=False, axlabel=False)
sns.distplot(data.loc[data['diagnosis'] == 'B', plot_col[i]], label='B', ax=axs[i], norm_hist = False, hist=False, axlabel=False)
```
%% Cell type:code id: tags:
``` python
# Here we vizualise all features
fig, axs = plt.subplots(10,3, figsize=(20,30))
col_ct = 0
for j in range(3):
for i in range(10):
axs[i,j].set_title(plot_col[col_ct])
sns.distplot(data.loc[data['diagnosis'] == 'M', plot_col[col_ct]], label='M', ax=axs[i,j], norm_hist = False, hist=False, axlabel=False)
sns.distplot(data.loc[data['diagnosis'] == 'B', plot_col[col_ct]], label='B', ax=axs[i,j], norm_hist = False, hist=False, axlabel=False)
col_ct+=1
plt.legend()
```
%% Cell type:markdown id: tags:
Here you can already take a look at which features seperate the M and B class nicely.
## TODO
**Which ones do you think we will be able to use in the model?**
%% Cell type:markdown id: tags:
Your answer:
%% Cell type:markdown id: tags:
### Correlations
We can look into the correlations between features by plotting scatter plots. Let's for example take the features *radius_mean* and *perimeter_worst*.
%% Cell type:code id: tags:
``` python
# scatter plot
fig,ax = plt.subplots(1,2, figsize=(16,6))
sns.scatterplot(x='radius_mean', y='perimeter_worst', data=data, ax=ax[0])
# scatter plot with label
sns.scatterplot(x='radius_mean', y='perimeter_worst', data=data, hue='diagnosis', ax=ax[1])
```
%% Cell type:markdown id: tags:
This plot shows a correlation between the two features, which means they are not independent. Also the plot is shown as a function of the label.
%% Cell type:code id: tags:
``` python
# scatter plot for the first 5 values
sns.pairplot(data, hue='diagnosis', vars=plot_col[:5])
```
%% Cell type:markdown id: tags:
## TODO
**Now try to make multiple pairplots yourself**
%% Cell type:code id: tags:
``` python
#Your code
```
%% Cell type:markdown id: tags:
A heatmap of the correlations can give insight in relations between features. Using the corr() function from pandas we can easily calculate the correlations.
However, notice that this is only the case for linear correlations!
%% Cell type:code id: tags:
``` python
# plot heatmap
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(data[plot_col].corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
```
%% Cell type:markdown id: tags:
This plot shows:
- The dark features show where there is no correlation
- The bright features do show a correlation. Features with high correlations contain both contain the same information
- The bright features do show a correlation. Features with high correlations both contain the same information
- On the diagonal you see the correlation of the feature with itself which is, naturally, always 1.0
%% Cell type:markdown id: tags:
## Impute NaN's
Here we will try to impute the NaN's in the data set. One of the columns with a lot of NaN's is the column *texture_mean*. We will take a look into the possibilities with this columns.
%% Cell type:code id: tags:
``` python
# replace the -999 values if not done yet
data = data.replace(-999, np.nan)
data = data.dropna(how='all', axis=0)
print("Number of NaN's: ", data['texture_mean'].isnull().sum())
```
%% Cell type:code id: tags:
``` python
data.isnull().all(axis=1).sum()
```
%% Cell type:code id: tags:
``` python
## impute values
from sklearn.impute import SimpleImputer
impMean = SimpleImputer(missing_values=np.nan, strategy='mean')
impFreq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
## try out the mean
data['texture_mean_impute_mean'] = impMean.fit_transform(data['texture_mean'].values.reshape(-1,1))
print('Imputed mean: ', impMean.statistics_)
## try out the most frequent value
data['texture_mean_impute_freq'] = impFreq.fit_transform(data['texture_mean'].values.reshape(-1,1))
print('Imputed freq: ', impFreq.statistics_)
## plot the distributions
fig, ax = plt.subplots(1, 2, figsize=(16,5))
sns.distplot(data['texture_mean'], bins=20, kde=False,label='Original', ax=ax[0])
sns.distplot(data['texture_mean_impute_mean'], bins=20, kde=False, label='Imputed Mean', ax=ax[0])
sns.distplot(data['texture_mean'], bins=20, kde=False,label='Original', ax=ax[1])
sns.distplot(data['texture_mean_impute_freq'], bins=20, kde=False, label='Imputed Freq', ax=ax[1])
ax[0].axvline(impMean.statistics_[0], label='Imputed value')
ax[1].axvline(impFreq.statistics_[0], label='Imputed value')
ax[0].legend()
ax[1].legend()
```
%% Cell type:markdown id: tags:
As you can see the imputed values significantly changes the distribution of the data. Let's try to impute values using the MICE method.
%% Cell type:code id: tags:
``` python
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# impute the nans
imp = IterativeImputer(random_state=0)
data_imp = pd.DataFrame(imp.fit_transform(data[plot_col]), columns=plot_col)
# plot the distribution
fig, ax = plt.subplots(1, figsize=(8,6))
sns.distplot(data['texture_mean'], bins=20, kde=False,label='Original', ax=ax)
sns.distplot(data_imp['texture_mean'], bins=20, kde=False, label='Imputed', ax=ax)
plt.legend()
```
%% Cell type:markdown id: tags:
## TODO
**Look into the other features and decide whether you want to use this implementation for the rest of your features**
%% Cell type:code id: tags:
``` python
# Your code
```
%% Cell type:markdown id: tags:
## TODO
Save your data as an csv file
%% Cell type:code id: tags:
``` python
# save data frame
data.to_csv('data_clean.csv', index=False)
```
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment