Commit 5181fb65 by Stoop

### Merge branch 'ljschinkelshoek-master-patch-45567' into 'master'

```Update 2.3_models_and_hyperparameters.ipynb

See merge request !23```
parents 6a6b3e8f 22cb28e4
 %% Cell type:markdown id: tags: # Introduction training a model In this notebook we will show you how to train a model using the data you previously preprocessed. %% Cell type:code id: tags: ``` python import pandas as pd # data processing import numpy as np # linear algebra import matplotlib.pyplot as plt # data vizualisation import seaborn as sns # data vizualisation import sklearn # machine learning import tensorflow as tf #Deep Learning from tensorflow import keras # neural networks from tensorflow.keras import layers # set random states np.random.seed(42) random_state = 42 def plot_training_hist(history): # summarize history for accuracy plt.plot(history.history['accuracy']) plt.plot(history.history['val_accuracy']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper left') plt.show() # summarize history for loss plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.title('model loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper left') plt.show() ``` %% Cell type:markdown id: tags: ## Exercise 1 **Read your previously cleaned data and split the data between a train and a test set** %% Cell type:code id: tags: ``` python # read the data # show 10 first lines: ``` %% Cell type:code id: tags: ``` python # split data in x_train,x_test,y_train and y_test # split data in X_train,X_test,y_train and y_test from sklearn.model_selection import train_test_split # split data in train and test print(f'number of features: {x_train.shape[1]}, number of training samples {x_train.shape[0]}, number of test samples {x_test.shape[0]}) print(f'number of features: {X_train.shape[1]}, number of training samples {X_train.shape[0]}, number of test samples {X_test.shape[0]}) ``` %% Cell type:markdown id: tags: ## Baseline The baseline is the value we would get if we would naively put everything to the most common label in the train data. %% Cell type:code id: tags: ``` python # find the base line value from which you want to improve from sklearn.metrics import accuracy_score # find most common label baseline_guess = np.median(y) print('Baseline guess value: ', baseline_guess) # calculate the accuracy for an array of baseline guess values (np.full(y.shape, baseline_guess)) baseline_performance = accuracy_score(np.full(y.shape, baseline_guess), y) print('Baseline Performance = {0:.2f}'.format(baseline_performance)) ``` %% Cell type:markdown id: tags: If a model does perform about this baseline performance than it means it is just guessing in stead of really learning and predicting %% Cell type:markdown id: tags: # Try some models First we will try out some models using cross validation and the sklearn package to get a feeling for which model we want to work with. %% Cell type:code id: tags: ``` python from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC # Support Vector machine from sklearn.neighbors import KNeighborsClassifier # K Nearest Neighbors from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score # cross validation from sklearn.model_selection import KFold # k fold cross validation ``` %% Cell type:markdown id: tags: ## Exercise 2 Train a model on the train data and predict on the test data. Use the code shown in the slides. Do this using one of the following networks: *LogisticRegression, KNeighborsClassifier, RandomForestClassifier* **Use the X_train data to train the model and X_test data to test the performance. Which model works the best?** %% Cell type:code id: tags: ``` python # Your code ``` %% Cell type:markdown id: tags: ## Exercise 3 Let's do the same for the neural network. Play around with the different values. Add new layers, change the number of nodes etc. How well does the model perform? %% Cell type:code id: tags: ``` python ## Change code ## define model model = keras.Sequential() # define the input layer input_shape = x_train.shape[1] input_shape = X_train.shape[1] model.add(keras.Input(shape = input_shape)) # add hidden layers to the model (2 nodes with relu activation) model.add(layers.Dense(15, activation='relu')) # add output layer (1 node with sigmoi activation) model.add(layers.Dense(1, activation='sigmoid')) # compile the model: specify the loss, metrics and optimizer model.compile( optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy']) ## fit the model history=model.fit(x_train,y_train,validation_data=(x_test,y_test),epochs=100,verbose=0) history=model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=100,verbose=0) ## print the score score = model.evaluate(x_test, y_test) score = model.evaluate(X_test, y_test) print('Loss = ', score[0], ', Accuracy = ', score[1]) ``` %% Cell type:code id: tags: ``` python plot_training_hist(history) ``` %% Cell type:markdown id: tags: # Cross validation For optimization of the hyper parameters we use cross validation. This also gives you some idea of the robustness of your model. In the following lines of code we will apply this technique to the ML and DL models. %% Cell type:markdown id: tags: ## Exercise 4 **Fill in the number of cross validations you want to use and the type of scoring. The different scoring parameters you can use can be found here: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter** %% Cell type:code id: tags: ``` python # You can play around with these values n_cv = 5 # number of cross validations score = 'accuracy' # scoring type cv = KFold(n_cv, shuffle=True, random_state=random_state) # define the type of cross validation ``` %% Cell type:code id: tags: ``` python # loop trough all the different model types and use cross validation (defined in previous cell) to roughly determive their performance rf=RandomForestClassifier() lr=LogisticRegression(max_iter=1000) svm=SVC() knn=KNeighborsClassifier() models = {'random forest':rf,'Logistic Regression':lr,'support vector machine':svm,'K-NN':knn} for model in models: # find the scores using cross validation scores = cross_val_score(models[model], x_train, y_train, cv = cv, scoring = score) scores = cross_val_score(models[model], X_train, y_train, cv = cv, scoring = score) mean_score = np.mean(scores) std_score = np.std(scores) print(f'\n{model}\n---------\nscores:{[round(score,2) for score in scores]}\nMean score {mean_score:.2f} (+/- {std_score:.2f})') ``` %% Cell type:markdown id: tags: ## Question **Which model shows the best results so far? Take the variation in scores per fold in to account** %% Cell type:markdown id: tags: Your answer: %% Cell type:markdown id: tags: ## Question **Would the model benefit from more data? Look into the next plot. Change the model to look into the different models.** %% Cell type:code id: tags: ``` python import warnings from sklearn.exceptions import ConvergenceWarning warnings.filterwarnings(action='ignore', category=ConvergenceWarning) from sklearn.model_selection import learning_curve model = LogisticRegression() # change this value train_sizes, train_scores, valid_scores = learning_curve( model, x_train, y_train, cv=cv) train_sizes, train_scores, valid_scores = learning_curve( model, X_train, y_train, cv=cv) plt.fill_between(train_sizes, np.mean(valid_scores, axis=1) - np.std(valid_scores, axis=1), np.mean(valid_scores, axis=1) + np.std(valid_scores, axis=1), alpha=0.3, color='g') plt.plot(train_sizes, np.mean(valid_scores,axis=1), color='g', marker='.', label='CV score') plt.plot(train_sizes, np.mean(train_scores,axis=1), color='r', marker='.', label='Train score') plt.xlabel('Number of patients') plt.ylabel('Score') plt.legend() ``` %% Cell type:markdown id: tags: Your answer: %% Cell type:markdown id: tags: # Optimising the hyperparameters ## Exercise 5 Now we will optimise the hyperparameters. Normally, based on the performances of the data you are able to choose one or two models you want to optimise. However, for illustration purposees, we will first optimise a Random Forest. We will use the easiest approach, namely, *Grid Search* to find the best hyperparameters. %% Cell type:code id: tags: ``` python from sklearn.model_selection import GridSearchCV # grid search using cross validation ``` %% Cell type:code id: tags: ``` python # define the options of hyperparameters to use for a random forest as defined in # https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier parameter_grid = { 'n_estimators':[50, 100, 150], # number of trees in the forest 'max_depth': [5, 10, 25], # maximum depth of a tree 'class_weight': [None, 'balanced'], # gives extra weights to the classes } # define the grid search clf = GridSearchCV(RandomForestClassifier(), param_grid = parameter_grid, cv = cv, verbose=1) # perform the grid search and model on the data clf.fit(x_train, y_train) clf.fit(X_train, y_train) # show the best parameters print('\nBest parameters: ', clf.best_params_) print('Score: {0:.2f}'.format(clf.best_score_)) # Performance on test set score_test = clf.score(x_test, y_test) score_test = clf.score(X_test, y_test) print('\nPerformance on test set:\nScore: {0:.2f}'.format(score_test)) ``` %% Cell type:markdown id: tags: ## Question Using the above example of code, improve the best performing ML model you found earlier. Check the sklearn documentation for the hyper pararmeters that can be used for your model. %% Cell type:code id: tags: ``` python # Your code ``` %% Cell type:markdown id: tags: ## Neural networks The following code shows how to optimise a Neural Network. This works basically the same way, but we first have to enable our GridSearchCV to create different neural networks. For this we define the function *create_model*: %% Cell type:code id: tags: ``` python # using the Keras Classifier wrapper we can use the keras model the same way as a sklearn model from tensorflow.keras.wrappers.scikit_learn import KerasClassifier # so we can use neural networks in sklearn # update create_models for tuning def create_model(hidden_layers = 1, nodes = 5, activation = 'relu',optimizer='rmsprop'): # define the parameters and their defeaults # define model model = keras.Sequential() # define the input layer input_shape = X.shape[1] model.add(keras.Input(shape = input_shape)) # add hidden layers to the model (2 nodes with relu activation) for layer in range(hidden_layers): model.add(layers.Dense(nodes, activation = activation)) # add output layer (1 node with sigmoi activation) model.add(layers.Dense(1, activation='sigmoid')) # compile the model: specify the loss, metrics and optimizer model.compile( optimizer = optimizer, loss = 'binary_crossentropy', metrics = ['accuracy'] ) return model ``` %% Cell type:markdown id: tags: ## Questions **Can you explain the model that this fuction generates with its default values (eg if we create the model like this: model = create_model() )? Answer the following questions:** **1. How many hidden layers are there?** **2. How many nodes does the input layer, the hidden layer and the output layer have?** **3. What are the activation functions in the hidden layer and output layer?** **4. How many weigths does the model have?** %% Cell type:markdown id: tags: Now that we have a function to create a model we can do a search over a parameter grid.
To use the sklearn package to apply GridSearch we have to put the model in a so called "wrapper". %% Cell type:code id: tags: ``` python # the parameters on which to seearch the grid. Note that these are the same in the create_model arguments parameter_grid = { 'hidden_layers': [1, 3, 5], 'nodes': [5, 10, 15], 'activation':['relu', 'sigmoid'] } # Define the number of epochs and batch size n_epochs = 50 # number of epochs batch_size = 8 # batch size cv = 3 # Use thewrapper model = KerasClassifier( build_fn=create_model, epochs = n_epochs, batch_size=batch_size, verbose=0, ) # define the grid search clf = GridSearchCV(model, param_grid = parameter_grid, cv = cv, verbose=1, refit=True) # perform the grid search and model on the data clf.fit(X, y) # show the best parameters print('\nBest parameters: ', clf.best_params_) print('Score: {0:.2f}'.format(clf.best_score_)) # Performance on test set score_test = clf.score(X_test, y_test) print('\nPerformance on test set:\nScore: {0:.2f}'.format(score_test)) ``` %% Cell type:markdown id: tags: ## Question **Try to improve the values for your DL model** - Include different values for the hyper parameters - Replace GriddSearchCV for another form of search (tip: use the BayesSearchCV from https://scikit-optimize.github.io/stable/auto_examples/sklearn-gridsearchcv-replacement.html) Install the package using the following code: %% Cell type:code id: tags: ``` python # install package to use BayesSearchCV ! pip install scikit-optimize ``` %% Cell type:code id: tags: ``` python # Your code ``` %% Cell type:markdown id: tags: ## Final task Save your model so you can validate it's performance next week. First run the cell with the pipeline which gave you the best results, the run the following cell: %% Cell type:code id: tags: ``` python import pickle pickle.dump(clf.best_estimator_, open('model.pkl', 'wb')) ``` %% Cell type:code id: tags: ``` python # check if it worked m = pickle.load(open('model.pkl', 'rb')) ``` ... ...
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment