Commit bbfbdd3e authored by Stoop's avatar Stoop
Browse files

Merge branch 'flspijkerboer-master-patch-05488' into 'master'

Upload EDA practicum

See merge request !11
parents e8c03e6b 3536c5b1
%% Cell type:code id: tags:
``` python
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
```
%% Cell type:markdown id: tags:
# Exploratory Data Analysis: Palmer Penguins
![](https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/lter_penguins.png)
## About the data
Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
Source: @allison_horst https://github.com/allisonhorst/penguins
%% Cell type:markdown id: tags:
## 1. Load data
For more information how to load data: https://www.analyticsvidhya.com/blog/2020/04/how-to-read-common-file-formats-python/
There are many sources where your data can originate from. The dataset we will be using in this notebook is integrated in a python package called `seaborn`. A dataset loaded from seaborn will be a Pandas dataframe and can be used as such. Pandas is a powerful library for data wrangling.
%% Cell type:code id: tags:
``` python
data_raw = sns.load_dataset('penguins')
data = data_raw() #make a copy of the original dataset
data.head(10)
```
%% Cell type:markdown id: tags:
### Task 1
Look at the first ten entries of your dataset. What stands out?
%% Cell type:raw id: tags:
your answer:
%% Cell type:markdown id: tags:
# 2. Understanding your data
We would like to know more about our data. We will use different Exploratory Data Analysis tools to do so. The questions we take into account to get a better understanding of our data are:
- What kind of data do we have and how was it gathered?
- How much data do we have?
- Is there missing data and why is it missing?
- Is the data intuitive and comparable to domain knowledge?
- ...
### Task 2.1
Apply the pandas functions `.shape` and `.info` to get insight in the amount of data you have. Write your code in the box below. Subsequently, try to describe your data and answer the following questions: What kind of data do we have? How many entries? What features do we have?
%% Cell type:code id: tags:
``` python
#your code
```
%% Cell type:raw id: tags:
Describe your dataset:
%% Cell type:markdown id: tags:
### Task 2.2
Look at your data types (Dtype) and compare it to your output. What is the differenct in Dtype object and float? If you want an overview of possible data types see https://pbpython.com/pandas_dtypes.html
%% Cell type:raw id: tags:
your answer:
%% Cell type:markdown id: tags:
### Task 2.3
Do you understand what all columns represent? What is the bill length of a penguin? Make sure you understand all of your columns and know in what range the value should normally be. Continue if this is clear to you.
%% Cell type:markdown id: tags:
![](https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png)
%% Cell type:markdown id: tags:
### Task 2.4
We then want to check if all values are in the expected range. You can use the `.describe` function to get the descriptive statistics of all numerical features. What do you think of the values?
%% Cell type:code id: tags:
``` python
#your code:
```
%% Cell type:raw id: tags:
What do you think of the values:
%% Cell type:markdown id: tags:
## 3. Missing values
%% Cell type:markdown id: tags:
Before we continue, we should look into the missing values in the dataset (NaN) and decide what to do with them.
### Task 3.1
Get an overview of the amount of missing data. You can either print the sum or plot the amount of missing values. You can use `.isna` or `.isnull` to get all values that are NaN. To plot the missing values you can use the seaborn function `.heatmap`.
What do you think of the results?
%% Cell type:code id: tags:
``` python
#your code here:
```
%% Cell type:raw id: tags:
What do you think of the results:
%% Cell type:markdown id: tags:
There are multiple strategies for dealing with missing data. For example, you could replace c.q. impute a missing values with the mean of the column. E.g. if for a particular penguin the value for body mass is missing, you could replace the NaN with the mean recorded body mass of all penguins.
We could also choose to first drop the rows which have multiple missing values, and subsequently impute the remaining missing values.
### Task 3.2
To keep things simple, we will just drop all rows with NaN values. Write code to drop those rows in your dataframe. You can use the function `.dropna` for that. Show the first 10 rows of the data once the NaN values have been dropped and check the shape.
%% Cell type:code id: tags:
``` python
#your code:
```
%% Cell type:markdown id: tags:
### Task 3.3
Next, we need to check for duplicates in the data and if necassery remove the duplicates. You can use pandas function `.duplicated` to check for duplicates. How many duplicates do we have?
%% Cell type:code id: tags:
``` python
#check if there are any duplicate entries in your data:
#help(pd.DataFrame.duplicated) #uncomment if you want to use the help function
#your code:
```
%% Cell type:markdown id: tags:
## 4. Analysing the data visually
Data cleaning and Exploratory Data Analysis go hand in hand, and are both an important part of understanding the data. In this part of the notebook we will look into the individual features in more detail.
### Number of species
There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica. Imagine, we want to predict the species based on the features which are included in the other columns. The species column thus will be our label. Let's look into the quality of our label.
### Task 4.1
We can analyse the number of species in various ways. We could count all unique objects in the species column or we could use a visualization of the label. Try both ways. You can use the pandas function `.value_counts` and the seaborn method `.countplot`.
%% Cell type:code id: tags:
``` python
#your code .value_counts
```
%% Cell type:code id: tags:
``` python
#your code .countplot
```
%% Cell type:markdown id: tags:
What do you think of the distribution of the label?
%% Cell type:raw id: tags:
your answer:
%% Cell type:markdown id: tags:
### Numerical features
We will use some visualisation tools to see how our numerical features are distributed.
### Task 4.2
create a boxplot of all numerical features. It might be insightfull to combine features within one boxplot. You can use the seaborn `.boxplot()` function for that.
%% Cell type:code id: tags:
``` python
#Your Code:
```
%% Cell type:markdown id: tags:
What do you think of the boxplots:
%% Cell type:raw id: tags:
your answer:
%% Cell type:markdown id: tags:
### Task 4.3
It can also help to look at the distribution of a feature in regards to the label. Use `.histplot` for each numerical feature to see the distribution. Add the label information in the plot.
%% Cell type:code id: tags:
``` python
# your code:
```
%% Cell type:markdown id: tags:
What do you notice when looking at the distibution plots in ragards to the species?
%% Cell type:raw id: tags:
your answer:
%% Cell type:markdown id: tags:
#### Correlation of (numerical) features:
To get more insight in the data, we will also look into the correlation between the features. The scatter plot is the ideal method to visualize this correlation. A nice tool to combine the visualization of the distribution and correlation of numerical features is the seaborn `.paiplot` method.
### Task 4.4
Create a pairplot and make sure you add the information about the label:
%% Cell type:code id: tags:
``` python
#your code:
```
%% Cell type:markdown id: tags:
What information can you retreive from the pairplot? What do you think of the correlation between the features?
%% Cell type:raw id: tags:
Your answer:
%% Cell type:markdown id: tags:
This is the end of this practicum. We have only looked at the numerical data for now. Feel free to also play around with the categorical data. You can also use this data set to try different imputation methods.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment