Commit 834ba170 authored by Stoop's avatar Stoop
Browse files

Delete 2_feature_engineering.ipynb

parent b2d47457
%% Cell type:markdown id: tags:
# Feature Engineering
In this practicum we will iteratively introduce new features to the model and decide which of the features we will in the end use.
For this exercise we will a RandomForest, a tree based model, and KNN model, a non-tree based model.
%% Cell type:code id: tags:
```
import pandas as pd # data processing
import numpy as np # linear algebra
import matplotlib.pyplot as plt # data vizualisation
import seaborn as sns # data vizualisation
import sklearn # machine learning
from sklearn.model_selection import train_test_split # split data
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold # k fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
# set random states
np.random.seed(42)
random_state = 42
```
%% Cell type:code id: tags:
```
def splits_data():
#read in the data
data = pd.read_csv('data_clean.csv')
# split data in X and y
X = data.drop(labels=['condition'], axis=1)
y = data['condition'].values
# Splits the data in train, validation and test using a stratified split
# We will use the test set to determine the overall performance of the model
X, X_test, y, y_test = train_test_split(X, y, test_size = 0.2, random_state = random_state, shuffle=True, stratify = y)
return X, X_test, y, y_test
X, X_test, y, y_test = splits_data()
```
%% Cell type:code id: tags:
```
# TODO: onderstaande opdracht is geen opdracht
```
%% Cell type:markdown id: tags:
## 1. TODO
For this practicum we will use the *KNeighborsClassifier* and *RandomForestClassifier*. In the following cells we created a function *calculate performance*. This function will apply the grid search for you.
What you have to do is define the parameters you want to train on. You can find all the parameters out here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
%% Cell type:code id: tags:
```
## Define the model and type of cross validation
knn = KNeighborsClassifier()
rf = RandomForestClassifier()
## Define the following
scoring = ['accuracy']
def calculate_performance(X, y, X_test, y_test, model):
# fit the model
model.fit(X, y)
# score
score_test = model.score(X_test, y_test)
print('Performance test set: {0:.2f}\n'.format(score_test))
print('Best params: ', clf.best_params_)
return model
```
%% Cell type:markdown id: tags:
## Create categories
Let's use the age feature to practise how to define continues values to categories. First we will create the categories using the function *category_age* as found below.
In order for the model to be able to read this, we need to encode the categories a.k.a. transform them to integers.
The categories are made in the function 'category_age' and transformed in 'apply_encoding'.
%% Cell type:code id: tags:
```
## use preprocess sklearn to fit and transform correctly
def category_age(df):
df['age'] = pd.cut(df['age'], bins=[0, 40, 50, 60, 70, 99], labels=['<40', '41-50', '51-60', '61-70', '70>'])
return df
def apply_encoding(X_train, X_test, column_name):
# create and fit the label encoder
label_encoder = preprocessing.OrdinalEncoder()
label_encoder.fit(X_train[column_name].to_numpy().reshape(-1,1))
# apply to both the train and test set
X_train[column_name] = label_encoder.transform(X_train[column_name].to_numpy().reshape(-1,1))
X_test[column_name] = label_encoder.transform(X_test[column_name].to_numpy().reshape(-1,1))
return X_train, X_test
X, X_test, y, y_test = splits_data()
print('-- original age -- \n', X['age'].head())
X = category_age(X)
X_test = category_age(X_test)
print('\n-- cagetory age ---\n', X['age'].head())
X, X_test = apply_encoding(X, X_test, 'age')
print('\n-- apply encoding ---\n', X['age'].head())
```
%% Output
-- original age -- \n 91 62\n115 53\n125 50\n63 41\n159 68\nName: age, dtype: int64\n\n-- cagetory age ---\n 91 61-70\n115 51-60\n125 41-50\n63 41-50\n159 61-70\nName: age, dtype: category\nCategories (5, object): [<40 < 41-50 < 51-60 < 61-70 < 70>]\n\n-- apply encoding ---\n 91 2.0\n115 1.0\n125 0.0\n63 0.0\n159 2.0\nName: age, dtype: float64\n
%% Cell type:markdown id: tags:
## Creating a Pipeline
In order to make our lives easier we will create a data pipeline, a function where we will apply all data transformations, which are both done on the train and test data. Again it is very important to correctly apply the transformations to the test data to prevent data leakage. In the function "data_pipeline" we will add our data transformation functions:
%% Cell type:code id: tags:
```
def data_pipeline(model=None):
# read in data
X, X_test, y, y_test = splits_data()
# create age categories
X = category_age(X)
X_test = category_age(X_test)
X, X_test = apply_encoding(X, X_test, 'age')
# Other functions to add later
# ...
if model is not None:
model.fit(X,y)
print('Performance {0}: {1:.2f}'.format(model.__class__.__name__, model.score(X_test, y_test)))
return X, y, X_test, y_test
X, y, X_test, y_test = data_pipeline(knn)
X, y, X_test, y_test = data_pipeline(rf)
```
%% Output
Performance KNeighborsClassifier: 0.58\nPerformance RandomForestClassifier: 0.87\n
%% Cell type:code id: tags:
```
```
%% Cell type:markdown id: tags:
Your answer:
%% Cell type:code id: tags:
```
```
%% Cell type:markdown id: tags:
## Scaling
If you want to do some scaling of the features, than we need to apply this with caution. If we apply scaling of any kind to the X data set (which we will use in the cross validation), we have to make use of the sklearn build in pipeline.
In the following bits of code we normalize (StandardScalere from sklearn) the column 'chol'.
%% Cell type:code id: tags:
```
from sklearn.preprocessing import StandardScaler
def data_pipeline(model=None):
# read in data
X, X_test, y, y_test = splits_data()
# Create age categories
X = category_age(X)
X_test = category_age(X_test)
X, X_test = apply_encoding(X, X_test, 'age')
# Scaling on column chol
scale = StandardScaler()
X['chol'] = scale.fit_transform(X['chol'].values.reshape(-1,1))
X_test['chol'] = scale.transform(X_test['chol'].values.reshape(-1,1))
# Other functions to add later
# ...
if model is not None: