brad-gh 0b2add5918 Update README.md		2 years ago
..
ex00	Renaming with uppercase of readme files to respect standard	2 years ago
ex01	Renaming with uppercase of readme files to respect standard	2 years ago
ex02	Renaming with uppercase of readme files to respect standard	2 years ago
ex03	Renaming with uppercase of readme files to respect standard	2 years ago
ex04	Renaming with uppercase of readme files to respect standard	2 years ago
ex05	Update README.md	2 years ago
ex06	fix: english correction	2 years ago
README.md	Renaming with uppercase of readme files to respect standard	2 years ago

README.md

W2D01 Piscine AI - Data Science

Machine Learning Pipeline

Today we will focus on the data preprocessing and discover the Pipeline object from scikit learn.

Manage categorical variables with Integer encoding and One Hot Encoding
Impute the missing values
Reduce the dimension of the data
Scale the data

The step 1 is always necessary. Models use numbers, for instance string data can't be processed raw.
The steps 2 is always necessary. Machine learning models use numbers, missing values do not have mathematical representations, that is why the missing values have to be imputed.
The step 3 is required when the dimension of the data set is high. The dimension reduction algorithms reduce the dimensionality of the data either by selecting the variables that contain most of the information (SelectKBest) or by transforming the data. Depending on the signal in the data and the data set size the dimension reduction is not always required. This step is not covered because of its complexity. The understanding of the theory behind is important. However, I suggest to give it a try during the projects.
The step 4 is required when using some type of Machine Learning algorithms. The Machine Learning algorithms that require the feature scaling are mostly KNN (K-Nearest Neighbors), Neural Networks, Linear Regression, and Logistic Regression. The reason why some algorithms work better with feature scaling is that the minimization of the loss function may be more difficult if each feature's range is completely different.

These steps are sequential. The output of step 1 is used as input for step 2 and so on; and, the output of step 4 is used as input for the Machine Learning model. Scikitlearn proposes an object: Pipeline.

As we know, the model evaluation methodology requires to split the data set in a train set and test set. The preprocessing is learned/fitted on the training set and applied on the test set.

This object takes as input the preprocessing transforms and a Machine Learning model. Then this object can be called the same way a Machine Learning model is called. This is pretty practical because we do not need anymore to carry many objects.

Exercises of the day

Exercise 1 Imputer 1
Exercise 2 Scaler
Exercise 3 One hot Encoder
Exercise 4 Ordinal Encoder
Exercise 5 Categorical variables
Exercise 6 Pipeline

Virtual Environment

Python 3.x
NumPy
Pandas
Matplotlib
Scikit Learn
Jupyter or JupyterLab

Version of Scikit Learn I used to do the exercises: 0.22. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years.

Ressources

README.md

W2D01 Piscine AI - Data Science

Machine Learning Pipeline

Exercises of the day

Virtual Environment

Ressources

Step 3

Step 4

Pipeline