In this jupyter notebook we will focus on how to explore your data and how to perform feature engineering and selection. We will apply the techniques discussed in the presentation. The codes can be copied and paste if you want to use the code for your own data.
We will work with the open data set *Heart Disease Data Set*. For more information about the data set: https://archive.ics.uci.edu/ml/datasets/heart+Disease+%28Diagnostic%29
%% Cell type:code id: tags:
``` python
# Load the python packages we will use in the following cell blocks
import pandas as pd # data processing
import numpy as np # linear algebra
import matplotlib.pyplot as plt # data vizualisation
import seaborn as sns # data vizualisation
```
%% Cell type:markdown id: tags:
## Data
The data is found in the file 'data_raw.csv' in the current folder.
### Task 1
Load the data from the csv into a pandas DataFrame and print the first ten rows of the data.
What do you notice based on the descriptive statistics and heatmap?
%% Cell type:raw id: tags:
Your answer:
%% Cell type:markdown id: tags:
### Task 6
What do you think the value -999 means?**
%% Cell type:raw id: tags:
Your answer:
%% Cell type:markdown id: tags:
----------------------------------------------
## Handeling Missing values
### Task 7
How many of these values (-999) are present per column? What to do with the -999 values in the target column?
%% Cell type:code id: tags:
``` python
## Your code
# number of values present per column
```
%% Cell type:code id: tags:
``` python
#your code:
# what to do with the -999 in the target column?
```
%% Cell type:markdown id: tags:
### Task 8
Replace the remaining -999 values with nan's or delete them and show the new statistics. To replace the values you can use the `.mask()` method. We will later impute these values if necessary.
What do you think of the descriptive statistics? Do you recognize values that stand out? Note that you will need to know the normal range of all values to discover outliers. This is where domain knowledge comes in:
%% Cell type:raw id: tags:
your answer:
%% Cell type:markdown id: tags:
----------------------------------------------
%% Cell type:markdown id: tags:
### Outliers
Unfortunatly, not all continues features are always nicely distributed as a normal function. Check for outliers using boxplots. Below, we first show an example on how to visualize and deal with the outliers in the 'thalach' column:
%% Cell type:code id: tags:
``` python
# box plot for one feature
sns.boxplot(x='thalach', data=data)
```
%%%% Output: execute_result
<matplotlib.axes._subplots.AxesSubplot at 0x28ee7f95460>
age sex cp trestbps chol fbs restecg thalach exang oldpeak \
49 50 0 1 120 244.0 0 0 -378.041696 0.0 1.1
slope ca thal condition
49 0 0.0 0 0
%% Cell type:markdown id: tags:
It is not possible that a heartrate is negative. Therefore, we will impute this value to a realistically possible value. There is only one entry where the value is negative, and the condition for this specific entry is zero. We will impute the 'thalach' value with the mean of all 'thalach' values which are positive and also have condition = 0 (no disease).
C:\Users\flspijkerboer\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).