Commit e79202df authored by Stoop's avatar Stoop
Browse files

Merge branch 'ljschinkelshoek-master-patch-71741' into 'master'

Update 2.1_splitting.ipynb

See merge request !22
parents 5181fb65 b2a5bd50
%% Cell type:markdown id: tags:
# Introduction splitting your data
In this notebook we will introduce splitting your data and fitting a model. We will use the penguins dataset. We shall split the data in different train/test sizes to see the effect on the models performance.
%% Cell type:code id: tags:
``` python
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
def plot_train_result(iterations,train_scores,test_scores,train_loss,test_loss):
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(11, 5))
ax1.plot(iterations,train_scores,label='train')
ax1.plot(iterations,test_scores,label='test')
ax1.set_title('Accuracy')
ax2.plot(iterations,train_loss,label='train')
ax2.plot(iterations,test_loss,label='test')
ax2.set_title('Loss')
ax1.legend()
ax2.legend()
plt.show()
```
%% Cell type:code id: tags:
``` python
#Load data and impute.
data = sns.load_dataset('penguins')
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')# strategy can also be mean or median
data.iloc[:,:] = imputer.fit_transform(data)
```
%% Cell type:markdown id: tags:
In this excersise we will only use the numerical columns. So we define a list of columns and a list for the label:
%% Cell type:code id: tags:
``` python
features = ['bill_length_mm','bill_depth_mm','flipper_length_mm','body_mass_g']
label = ['species']
data[features+label]
```
%%%% Output: execute_result
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g species
0 39.1 18.7 181.0 3750.0 Adelie
1 39.5 17.4 186.0 3800.0 Adelie
2 40.3 18.0 195.0 3250.0 Adelie
3 41.1 17.0 190.0 3800.0 Adelie
4 36.7 19.3 193.0 3450.0 Adelie
.. ... ... ... ... ...
339 41.1 17.0 190.0 3800.0 Gentoo
340 46.8 14.3 215.0 4850.0 Gentoo
341 50.4 15.7 222.0 5750.0 Gentoo
342 45.2 14.8 212.0 5200.0 Gentoo
343 49.9 16.1 213.0 5400.0 Gentoo
[344 rows x 5 columns]
%% Cell type:code id: tags:
``` python
#we will underfit the logistic regression to see its effect. And it will warn us that we are underfitting. To stop sklearn from warning us we tell it to ignore underfitting:
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings(action='ignore', category=ConvergenceWarning)
```
%% Cell type:markdown id: tags:
## Excercise 1
As a first excersie we will classify the Adelie penguins based on the penguins bill length, bill depth, flipper length and body mass. We call X the data and Y the labels (which we need as numpy arrays for this logistic regressor):
%% Cell type:code id: tags:
``` python
X=data[features].values
Y=(data[label]=='Adelie').values
y=(data[label]=='Adelie').values
```
%% Cell type:markdown id: tags:
Read the documentations for sklearn's train_test_split and implement the function to split the data in 33% test and 67% training data: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import train_test_split
print('training data shape is:',x_train.shape, 'the training labels:',y_train.shape)
print('test data shape is:',x_test.shape,'the test labels',y_test.shape)
print('training data shape is:',X_train.shape, 'the training labels:',y_train.shape)
print('test data shape is:',X_test.shape,'the test labels',y_test.shape)
```
%% Cell type:markdown id: tags:
For classification the log_loss is the best loss funtcion to use and in this instance we will track our models accuracy to assess the performance.
The code below trains a logistic regression classifier for at most 'i' iterations. By letting the number of iterations increase in the for loop we allow the model more and more training time. Finally we plot the training history to see how the accuracy and loss changed over time.
%% Cell type:code id: tags:
``` python
from sklearn.metrics import accuracy_score,log_loss
number_of_iterations=20
train_scores,test_scores,train_loss,test_loss=[],[],[],[]
iterations=[i for i in range(number_of_iterations)]
for i in iterations:
model = LogisticRegression(max_iter=i)
model = model.fit(x_train,y_train.ravel())
y_train_pred=model.predict(x_train)
y_test_pred=model.predict(x_test)
model = model.fit(X_train,y_train.ravel())
y_train_pred=model.predict(X_train)
y_test_pred=model.predict(X_test)
train_scores.append( accuracy_score(y_train,y_train_pred) )
test_scores.append( accuracy_score(y_test,y_test_pred) )
train_loss.append( log_loss(y_train,y_train_pred) )
test_loss.append( log_loss(y_test,y_test_pred) )
```
%% Cell type:code id: tags:
``` python
plot_train_result(iterations,train_scores,test_scores,train_loss,test_loss)
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
## Exercise 2
By trying out different configurations and plotting the result: Find a good balance between train and test size and find a good number of itereations to let the model train.
What did you learn?
%% Cell type:raw id: tags:
#Your answer here
%% Cell type:markdown id: tags:
## Exercise 3
Now lets try the same but with a regression problem: lets predict the body mass of a penguin based on its bill and flipper length and bill depth.
%% Cell type:code id: tags:
``` python
features = ['bill_length_mm','bill_depth_mm','flipper_length_mm']
label = ['body_mass_g']
X=data[features].values
Y=data[label].values
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=.3)
y=data[label].values
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.3)
```
%% Cell type:code id: tags:
``` python
from sklearn.metrics import accuracy_score,mean_squared_error
train_scores=[]
test_scores=[]
train_loss=[]
test_loss=[]
iterations=[i for i in range(15)]
for i in iterations:
model = LogisticRegression(max_iter=i)
model = model.fit(x_train,y_train.ravel())
y_train_pred=model.predict(x_train)
y_test_pred=model.predict(x_test)
model = model.fit(X_train,y_train.ravel())
y_train_pred=model.predict(X_train)
y_test_pred=model.predict(X_test)
train_scores.append( accuracy_score(y_train,y_train_pred) )
test_scores.append( accuracy_score(y_test,y_test_pred) )
train_loss.append( mean_squared_error(y_train,y_train_pred) )
test_loss.append( mean_squared_error(y_test,y_test_pred) )
```
%% Cell type:code id: tags:
``` python
plot_train_result(iterations,train_scores,test_scores,train_loss,test_loss)
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
Again: play around with different train/test sizes and number of iterations. <br>
Question: What does the accuracy tell you in this example? (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score)
%% Cell type:raw id: tags:
#Your answer
%% Cell type:markdown id: tags:
Question: how does this problem compare to the classification? What differences do you observe in the training history?
%% Cell type:raw id: tags:
#Your answer
%% Cell type:code id: tags:
``` python
```
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment