 %% Cell type:markdown id: tags: # Data preprocessing In this jupyter notebook we will focus on how to explore your data and how to perform feature engineering and selection. We will apply the techniques discussed in the presentation. The codes can be copied and paste if you want to use the code for your own data. We will work with the open data set *Heart Disease Data Set*. For more information about the data set: https://archive.ics.uci.edu/ml/datasets/heart+Disease+%28Diagnostic%29 %% Cell type:code id: tags: ``` python # Load the python packages we will use in the following cell blocks import pandas as pd # data processing import numpy as np # linear algebra import matplotlib.pyplot as plt # data vizualisation import seaborn as sns # data vizualisation ``` %% Cell type:markdown id: tags: ## Data The data is found in the file 'data_raw.csv' in the current folder. ### Task 1 Load the data from the csv into a pandas DataFrame and print the first ten rows of the data. %% Cell type:code id: tags: ``` python #Your code: # read csv file and print header ``` %% Cell type:code id: tags: ``` python # feature names in the csv file col = data.columns print(col) ``` %%%% Output: stream Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'condition'], dtype='object') %% Cell type:markdown id: tags: The following information is given by the source of the data: 1) age: age in years 2) sex: sex (1 = male; 0 = female) 3) cp: chest pain type     -- Value 0: typical angina     -- Value 1: atypical angina     -- Value 2: non-anginal pain     -- Value 3: asymptomatic 4) trestbps: resting blood pressure (in mm Hg on admission to the hospital) 5) chol: serum cholestoral in mg/dl 6) fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) 7) restecg: resting electrocardiographic results     -- Value 0: normal     -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)     -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria 8) thalach: maximum heart rate achieved 9) exang: exercise induced angina (1 = yes; 0 = no) 10) oldpeak = ST depression induced by exercise relative to rest 11) slope: the slope of the peak exercise ST segment     -- Value 0: upsloping     -- Value 1: flat     -- Value 2: downsloping 12) ca: number of major vessels (0-3) colored by flourosopy 13) thal: 0 = normal; 1 = fixed defect; 2 = reversable defect 14) condition: the target feature. 0 = no disease, 1 = disease %% Cell type:markdown id: tags: # Explanatory Data Analysis First we will look into the data and see what we have. We will do this by using various plots. In the end we will want to predict the column *condition* for these patients, thus this will be the label. ### Task 2 plot the distribution of the label. You can use the seaborn function `.countplot()` for this. %% Cell type:code id: tags: ``` python #Your code: ``` %% Cell type:markdown id: tags: Next, we want to get more insight in the other features we have: ### Task 3 Find out what kind of data we have per column. Does the results make sense? %% Cell type:code id: tags: ``` python #your code ``` %% Cell type:raw id: tags: What do you think of the data types? Can you see what processing steps have been performed already? Your answer: Your answer: %% Cell type:markdown id: tags: ### Task 4 Create an overview of the data and their statistics using the `describe()` function of pandas. %% Cell type:code id: tags: ``` python #your code ``` %% Cell type:code id: tags: ``` python ## this code visualises where the value -999 is present in the table sns.heatmap(data.where(data==-999, 0).astype(bool), cbar=False) ``` %%%% Output: execute_result %%%% Output: display_data ![]() %% Cell type:markdown id: tags: ### Task 5 What do you notice based on the descriptive statistics and heatmap? %% Cell type:raw id: tags: Your answer: %% Cell type:markdown id: tags: ### Task 6 What do you think the value -999 means?** %% Cell type:raw id: tags: Your answer: %% Cell type:markdown id: tags: ---------------------------------------------- ## Handeling Missing values ### Task 7 How many of these values (-999) are present per column? What to do with the -999 values in the target column? %% Cell type:code id: tags: ``` python ## Your code # number of values present per column ``` %% Cell type:code id: tags: ``` python #your code: # what to do with the -999 in the target column? ``` %% Cell type:markdown id: tags: ### Task 8 Replace the remaining -999 values with nan's or delete them and show the new statistics. To replace the values you can use the `.mask()` method. We will later impute these values if necessary. %% Cell type:code id: tags: ``` python #your code ``` %% Cell type:code id: tags: ``` python # plot the new statistics data.describe().T ``` %% Cell type:code id: tags: ``` python # update columns col = data.columns print(col) ``` %%%% Output: stream Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'condition'], dtype='object') %% Cell type:markdown id: tags: ### Task 9 What do you think of the descriptive statistics? Do you recognize values that stand out? Note that you will need to know the normal range of all values to discover outliers. This is where domain knowledge comes in: %% Cell type:raw id: tags: your answer: %% Cell type:markdown id: tags: ---------------------------------------------- %% Cell type:markdown id: tags: ### Outliers Unfortunatly, not all continues features are always nicely distributed as a normal function. Check for outliers using boxplots. Below, we first show an example on how to visualize and deal with the outliers in the 'thalach' column: %% Cell type:code id: tags: ``` python # box plot for one feature sns.boxplot(x='thalach', data=data) ``` %%%% Output: execute_result %%%% Output: display_data ![]() %% Cell type:code id: tags: ``` python ## view outliers thalach patients: data.loc[(data.thalach < 0), 'thalach'] ``` %%%% Output: execute_result 49 -378.041696 Name: thalach, dtype: float64 %% Cell type:code id: tags: ``` python #view the entire row: data.loc[(data.thalach < 0)] ``` %%%% Output: execute_result age sex cp trestbps chol fbs restecg thalach exang oldpeak \ 49 50 0 1 120 244.0 0 0 -378.041696 0.0 1.1 slope ca thal condition 49 0 0.0 0 0 %% Cell type:markdown id: tags: It is not possible that a heartrate is negative. Therefore, we will impute this value to a realistically possible value. There is only one entry where the value is negative, and the condition for this specific entry is zero. We will impute the 'thalach' value with the mean of all 'thalach' values which are positive and also have condition = 0 (no disease). %% Cell type:code id: tags: ``` python ## impute the outliers with the mean: #create the mean based on the condition: thalach_0 = data.loc[(data.thalach > 0) & (data.condition == 0), 'thalach'].mean() print('imputed values thalach:', thalach_0) # impute values data.loc[(data.thalach < 0) & (data.condition == 0), 'thalach'] = thalach_0 # print imputed data data.loc[[49]] ``` %%%% Output: stream imputed values thalach: 158.61935483870968 %%%% Output: execute_result age sex cp trestbps chol fbs restecg thalach exang oldpeak \ 49 50 0 1 120 244.0 0 0 158.619355 0.0 1.1 slope ca thal condition 49 0 0.0 0 0 %% Cell type:markdown id: tags: ----------------------------------- ### Task 9 Look into the data using boxplots to find outliers for the other continues features. Which of these values are physically not possilbe? %% Cell type:code id: tags: ``` python ## Your code ``` %% Cell type:raw id: tags: What do you think of the outliers? You have already looked up the possible ranges. Which values are and which are not realistic? Your answer: %% Cell type:markdown id: tags: ### Task 10 What do you want to do with the unrealistic values you have found. Update the dataframe accordingly: %% Cell type:code id: tags: ``` python ## your code: ``` %% Cell type:markdown id: tags: -------------------------------- ### Distributions per category Now we will look into the distributions of the two labels. This will give us an idea of usefull features. Notice that categorical features are shown in a barplot, while continues features are shown as a histogram. %% Cell type:code id: tags: ``` python # Let's look at the first continues features as a function of the diagnosis plot_col = ['age','trestbps','chol','thalach','oldpeak'] fig, axs = plt.subplots(5,1, figsize=(6,15)) for i in range(5): axs[i].set_title(plot_col[i]) sns.distplot(data.loc[data['condition'] == 0, plot_col[i]].dropna(), label='no disease', ax=axs[i], hist=True, axlabel=False, kde=False) sns.distplot(data.loc[data['condition'] == 1, plot_col[i]].dropna(), label='disease', ax=axs[i], hist=True, axlabel=False, kde=False) axs[i].legend() ``` %%%% Output: stream C:\Users\flspijkerboer\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) %%%% Output: display_data ![]() %% Cell type:code id: tags: ``` python plot_col = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal', 'condition'] fig, axs = plt.subplots(5,1, figsize=(6,15)) for i in range(5): sns.countplot(data=data.dropna(), hue='condition', x=plot_col[i], ax=axs[i], alpha=0.5) axs[i].legend(loc='upper left') ``` %%%% Output: display_data ![]() %% Cell type:markdown id: tags: Here you can already take a look at which features seperate the condition class nicely. ---------------------------------------------- ### Task 11 **Which features do you think we will be able to use in the model?** %% Cell type:raw id: tags: Your answer: %% Cell type:markdown id: tags: ----------------------------------------------- ### Correlations We can look into the correlations between features by plotting scatter plots. Let's for example take the features *trestbps* and *chol*. %% Cell type:code id: tags: ``` python # scatter plot fig,ax = plt.subplots(1,2, figsize=(16,6)) sns.scatterplot(x='chol', y='trestbps', data=data, ax=ax[0]) # scatter plot with label sns.scatterplot(x='chol', y='trestbps', data=data, hue='condition', ax=ax[1]) ``` %%%% Output: execute_result %%%% Output: display_data ![]() %% Cell type:markdown id: tags: This plot shows there is not a strong correlation between the two features. Also the plot is shown as a function of the label. %% Cell type:markdown id: tags: For categorical data is is more useful to look into categorical plots (catplot) like the following: %% Cell type:code id: tags: ``` python sns.catplot(x='sex', hue='condition', data=data, y='age', kind='bar') ``` %%%% Output: execute_result %%%% Output: display_data ![]() %% Cell type:code id: tags: ``` python # scatter plot for the first 5 values sns.pairplot(data, hue='condition', vars=data.columns[:5]) ``` %%%% Output: execute_result %%%% Output: display_data ![](