In this notebook we will introduce splitting your data and fitting a model. We will use the penguins dataset. We shall split the data in different train/test sizes to see the effect on the models performance.
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g species
0 39.1 18.7 181.0 3750.0 Adelie
1 39.5 17.4 186.0 3800.0 Adelie
2 40.3 18.0 195.0 3250.0 Adelie
3 41.1 17.0 190.0 3800.0 Adelie
4 36.7 19.3 193.0 3450.0 Adelie
.. ... ... ... ... ...
339 41.1 17.0 190.0 3800.0 Gentoo
340 46.8 14.3 215.0 4850.0 Gentoo
341 50.4 15.7 222.0 5750.0 Gentoo
342 45.2 14.8 212.0 5200.0 Gentoo
343 49.9 16.1 213.0 5400.0 Gentoo
[344 rows x 5 columns]
%% Cell type:code id: tags:
``` python
#we will underfit the logistic regression to see its effect. And it will warn us that we are underfitting. To stop sklearn from warning us we tell it to ignore underfitting:
As a first excersie we will classify the Adelie penguins based on the penguins bill length, bill depth, flipper length and body mass. We call X the data and Y the labels (which we need as numpy arrays for this logistic regressor):
%% Cell type:code id: tags:
``` python
X=data[features].values
Y=(data[label]=='Adelie').values
y=(data[label]=='Adelie').values
```
%% Cell type:markdown id: tags:
Read the documentations for sklearn's train_test_split and implement the function to split the data in 33% test and 67% training data: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
%% Cell type:code id: tags:
``` python
fromsklearn.model_selectionimporttrain_test_split
print('training data shape is:',x_train.shape,'the training labels:',y_train.shape)
print('test data shape is:',x_test.shape,'the test labels',y_test.shape)
print('training data shape is:',X_train.shape,'the training labels:',y_train.shape)
print('test data shape is:',X_test.shape,'the test labels',y_test.shape)
```
%% Cell type:markdown id: tags:
For classification the log_loss is the best loss funtcion to use and in this instance we will track our models accuracy to assess the performance.
The code below trains a logistic regression classifier for at most 'i' iterations. By letting the number of iterations increase in the for loop we allow the model more and more training time. Finally we plot the training history to see how the accuracy and loss changed over time.
By trying out different configurations and plotting the result: Find a good balance between train and test size and find a good number of itereations to let the model train.
What did you learn?
%% Cell type:raw id: tags:
#Your answer here
%% Cell type:markdown id: tags:
## Exercise 3
Now lets try the same but with a regression problem: lets predict the body mass of a penguin based on its bill and flipper length and bill depth.
Again: play around with different train/test sizes and number of iterations. <br>
Question: What does the accuracy tell you in this example? (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score)
%% Cell type:raw id: tags:
#Your answer
%% Cell type:markdown id: tags:
Question: how does this problem compare to the classification? What differences do you observe in the training history?