Commit 22cb28e4 authored by Schinkelshoek's avatar Schinkelshoek
Browse files

Update 2.3_models_and_hyperparameters.ipynb

parent 93f18fa7
%% Cell type:markdown id: tags:
# Introduction training a model
In this notebook we will show you how to train a model using the data you previously preprocessed.
%% Cell type:code id: tags:
``` python
import pandas as pd # data processing
import numpy as np # linear algebra
import matplotlib.pyplot as plt # data vizualisation
import seaborn as sns # data vizualisation
import sklearn # machine learning
import tensorflow as tf #Deep Learning
from tensorflow import keras # neural networks
from tensorflow.keras import layers
# set random states
np.random.seed(42)
random_state = 42
def plot_training_hist(history):
# summarize history for accuracy
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
```
%% Cell type:markdown id: tags:
## Exercise 1
**Read your previously cleaned data and split the data between a train and a test set**
%% Cell type:code id: tags:
``` python
# read the data
# show 10 first lines:
```
%% Cell type:code id: tags:
``` python
# split data in x_train,x_test,y_train and y_test
# split data in X_train,X_test,y_train and y_test
from sklearn.model_selection import train_test_split
# split data in train and test
print(f'number of features: {x_train.shape[1]}, number of training samples {x_train.shape[0]}, number of test samples {x_test.shape[0]})
print(f'number of features: {X_train.shape[1]}, number of training samples {X_train.shape[0]}, number of test samples {X_test.shape[0]})
```
%% Cell type:markdown id: tags:
## Baseline
The baseline is the value we would get if we would naively put everything to the most common label in the train data.
%% Cell type:code id: tags:
``` python
# find the base line value from which you want to improve
from sklearn.metrics import accuracy_score
# find most common label
baseline_guess = np.median(y)
print('Baseline guess value: ', baseline_guess)
# calculate the accuracy for an array of baseline guess values (np.full(y.shape, baseline_guess))
baseline_performance = accuracy_score(np.full(y.shape, baseline_guess), y)
print('Baseline Performance = {0:.2f}'.format(baseline_performance))
```
%% Cell type:markdown id: tags:
If a model does perform about this baseline performance than it means it is just guessing in stead of really learning and predicting
%% Cell type:markdown id: tags:
# Try some models
First we will try out some models using cross validation and the sklearn package to get a feeling for which model we want to work with.
%% Cell type:code id: tags:
``` python
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC # Support Vector machine
from sklearn.neighbors import KNeighborsClassifier # K Nearest Neighbors
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score # cross validation
from sklearn.model_selection import KFold # k fold cross validation
```
%% Cell type:markdown id: tags:
## Exercise 2
Train a model on the train data and predict on the test data. Use the code shown in the slides. Do this using one of the following networks: *LogisticRegression, KNeighborsClassifier, RandomForestClassifier*
**Use the X_train data to train the model and X_test data to test the performance. Which model works the best?**
%% Cell type:code id: tags:
``` python
# Your code
```
%% Cell type:markdown id: tags:
## Exercise 3
Let's do the same for the neural network. Play around with the different values. Add new layers, change the number of nodes etc. How well does the model perform?
%% Cell type:code id: tags:
``` python
## Change code
## define model
model = keras.Sequential()
# define the input layer
input_shape = x_train.shape[1]
input_shape = X_train.shape[1]
model.add(keras.Input(shape = input_shape))
# add hidden layers to the model (2 nodes with relu activation)
model.add(layers.Dense(15, activation='relu'))
# add output layer (1 node with sigmoi activation)
model.add(layers.Dense(1, activation='sigmoid'))
# compile the model: specify the loss, metrics and optimizer
model.compile(
optimizer = 'adam',
loss = 'binary_crossentropy',
metrics = ['accuracy'])
## fit the model
history=model.fit(x_train,y_train,validation_data=(x_test,y_test),epochs=100,verbose=0)
history=model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=100,verbose=0)
## print the score
score = model.evaluate(x_test, y_test)
score = model.evaluate(X_test, y_test)
print('Loss = ', score[0], ', Accuracy = ', score[1])
```
%% Cell type:code id: tags:
``` python
plot_training_hist(history)
```
%% Cell type:markdown id: tags:
# Cross validation
For optimization of the hyper parameters we use cross validation. This also gives you some idea of the robustness of your model. In the following lines of code we will apply this technique to the ML and DL models.
%% Cell type:markdown id: tags:
## Exercise 4
**Fill in the number of cross validations you want to use and the type of scoring. The different scoring parameters you can use can be found here: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter**
%% Cell type:code id: tags:
``` python
# You can play around with these values
n_cv = 5 # number of cross validations
score = 'accuracy' # scoring type
cv = KFold(n_cv, shuffle=True, random_state=random_state) # define the type of cross validation
```
%% Cell type:code id: tags:
``` python
# loop trough all the different model types and use cross validation (defined in previous cell) to roughly determive their performance
rf=RandomForestClassifier()
lr=LogisticRegression(max_iter=1000)
svm=SVC()
knn=KNeighborsClassifier()
models = {'random forest':rf,'Logistic Regression':lr,'support vector machine':svm,'K-NN':knn}
for model in models:
# find the scores using cross validation
scores = cross_val_score(models[model], x_train, y_train, cv = cv, scoring = score)
scores = cross_val_score(models[model], X_train, y_train, cv = cv, scoring = score)
mean_score = np.mean(scores)
std_score = np.std(scores)
print(f'\n{model}\n---------\nscores:{[round(score,2) for score in scores]}\nMean score {mean_score:.2f} (+/- {std_score:.2f})')
```
%% Cell type:markdown id: tags:
## Question
**Which model shows the best results so far? Take the variation in scores per fold in to account**
%% Cell type:markdown id: tags:
Your answer:
%% Cell type:markdown id: tags:
## Question
**Would the model benefit from more data? Look into the next plot. Change the model to look into the different models.**
%% Cell type:code id: tags:
``` python
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings(action='ignore', category=ConvergenceWarning)
from sklearn.model_selection import learning_curve
model = LogisticRegression() # change this value
train_sizes, train_scores, valid_scores = learning_curve( model, x_train, y_train, cv=cv)
train_sizes, train_scores, valid_scores = learning_curve( model, X_train, y_train, cv=cv)
plt.fill_between(train_sizes, np.mean(valid_scores, axis=1) - np.std(valid_scores, axis=1), np.mean(valid_scores, axis=1) + np.std(valid_scores, axis=1), alpha=0.3, color='g')
plt.plot(train_sizes, np.mean(valid_scores,axis=1), color='g', marker='.', label='CV score')
plt.plot(train_sizes, np.mean(train_scores,axis=1), color='r', marker='.', label='Train score')
plt.xlabel('Number of patients')
plt.ylabel('Score')
plt.legend()
```
%% Cell type:markdown id: tags:
Your answer:
%% Cell type:markdown id: tags:
# Optimising the hyperparameters
## Exercise 5
Now we will optimise the hyperparameters.
Normally, based on the performances of the data you are able to choose one or two models you want to optimise. However, for illustration purposees, we will first optimise a Random Forest. We will use the easiest approach, namely, *Grid Search* to find the best hyperparameters.
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import GridSearchCV # grid search using cross validation
```
%% Cell type:code id: tags:
``` python
# define the options of hyperparameters to use for a random forest as defined in
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
parameter_grid = {
'n_estimators':[50, 100, 150], # number of trees in the forest
'max_depth': [5, 10, 25], # maximum depth of a tree
'class_weight': [None, 'balanced'], # gives extra weights to the classes
}
# define the grid search
clf = GridSearchCV(RandomForestClassifier(), param_grid = parameter_grid, cv = cv, verbose=1)
# perform the grid search and model on the data
clf.fit(x_train, y_train)
clf.fit(X_train, y_train)
# show the best parameters
print('\nBest parameters: ', clf.best_params_)
print('Score: {0:.2f}'.format(clf.best_score_))
# Performance on test set
score_test = clf.score(x_test, y_test)
score_test = clf.score(X_test, y_test)
print('\nPerformance on test set:\nScore: {0:.2f}'.format(score_test))
```
%% Cell type:markdown id: tags:
## Question
Using the above example of code, improve the best performing ML model you found earlier. Check the sklearn documentation for the hyper pararmeters that can be used for your model.
%% Cell type:code id: tags:
``` python
# Your code
```
%% Cell type:markdown id: tags:
## Neural networks
The following code shows how to optimise a Neural Network. This works basically the same way, but we first have to enable our GridSearchCV to create different neural networks. For this we define the function *create_model*:
%% Cell type:code id: tags:
``` python
# using the Keras Classifier wrapper we can use the keras model the same way as a sklearn model
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier # so we can use neural networks in sklearn
# update create_models for tuning
def create_model(hidden_layers = 1, nodes = 5, activation = 'relu',optimizer='rmsprop'): # define the parameters and their defeaults
# define model
model = keras.Sequential()
# define the input layer
input_shape = X.shape[1]
model.add(keras.Input(shape = input_shape))
# add hidden layers to the model (2 nodes with relu activation)
for layer in range(hidden_layers):
model.add(layers.Dense(nodes, activation = activation))
# add output layer (1 node with sigmoi activation)
model.add(layers.Dense(1, activation='sigmoid'))
# compile the model: specify the loss, metrics and optimizer
model.compile(
optimizer = optimizer,
loss = 'binary_crossentropy',
metrics = ['accuracy']
)
return model
```
%% Cell type:markdown id: tags:
## Questions
**Can you explain the model that this fuction generates with its default values (eg if we create the model like this:
model = create_model() )?
Answer the following questions:**
**1. How many hidden layers are there?**
**2. How many nodes does the input layer, the hidden layer and the output layer have?**
**3. What are the activation functions in the hidden layer and output layer?**
**4. How many weigths does the model have?**
%% Cell type:markdown id: tags:
Now that we have a function to create a model we can do a search over a parameter grid. <br>
To use the sklearn package to apply GridSearch we have to put the model in a so called "wrapper".
%% Cell type:code id: tags:
``` python
# the parameters on which to seearch the grid. Note that these are the same in the create_model arguments
parameter_grid = {
'hidden_layers': [1, 3, 5],
'nodes': [5, 10, 15],
'activation':['relu', 'sigmoid']
}
# Define the number of epochs and batch size
n_epochs = 50 # number of epochs
batch_size = 8 # batch size
cv = 3
# Use thewrapper
model = KerasClassifier(
build_fn=create_model,
epochs = n_epochs,
batch_size=batch_size,
verbose=0,
)
# define the grid search
clf = GridSearchCV(model, param_grid = parameter_grid, cv = cv, verbose=1, refit=True)
# perform the grid search and model on the data
clf.fit(X, y)
# show the best parameters
print('\nBest parameters: ', clf.best_params_)
print('Score: {0:.2f}'.format(clf.best_score_))
# Performance on test set
score_test = clf.score(X_test, y_test)
print('\nPerformance on test set:\nScore: {0:.2f}'.format(score_test))
```
%% Cell type:markdown id: tags:
## Question
**Try to improve the values for your DL model**
- Include different values for the hyper parameters
- Replace GriddSearchCV for another form of search (tip: use the BayesSearchCV from https://scikit-optimize.github.io/stable/auto_examples/sklearn-gridsearchcv-replacement.html) Install the package using the following code:
%% Cell type:code id: tags:
``` python
# install package to use BayesSearchCV
! pip install scikit-optimize
```
%% Cell type:code id: tags:
``` python
# Your code
```
%% Cell type:markdown id: tags:
## Final task
Save your model so you can validate it's performance next week. First run the cell with the pipeline which gave you the best results, the run the following cell:
%% Cell type:code id: tags:
``` python
import pickle
pickle.dump(clf.best_estimator_, open('model.pkl', 'wb'))
```
%% Cell type:code id: tags:
``` python
# check if it worked
m = pickle.load(open('model.pkl', 'rb'))
```
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment