You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
brad-gh feaf2ed295 update audit with two checkpoints 2 years ago
..
ex00 Renaming with uppercase of readme files to respect standard 2 years ago
ex01 Renaming with uppercase of readme files to respect standard 2 years ago
ex02 Renaming with uppercase of readme files to respect standard 2 years ago
ex03 fix: formulation 2 years ago
ex04 update audit with two checkpoints 2 years ago
README.md Renaming with uppercase of readme files to respect standard 2 years ago

README.md

W2D05 Piscine AI - Data Science

Model selection methodology

If you finished yesterday's exercises you should be able to train several Machine Learning algorithms and to choose one returned by GridSearchCV. GridSearchCV returns the model that gives the best score on the test set. Yesterday, as I told you, I changed the cv parameter to compute the GridSearch with a train set and a test set.

It means that the selected model is based on one single measure. What if, by luck, we predict correctly on that section ? What if the best model is bad ? What if I could have selected a better model ?

We will answer these questions today ! The topics we will cover are the one of the most important in Machine Learning.

Exercises of the day

  • Exercise 0 Environment and libraries
  • Exercise 1 K-Fold
  • Exercise 2 Cross validation (k-fold)
  • Exercise 3 GridsearchCV
  • Exercise 4 Validation curve and Learning curve

Virtual Environment

  • Python 3.x
  • NumPy
  • Pandas
  • Jupyter or JupyterLab
  • Scikit-learn
  • Matplotlib

Version of Pandas I used to do the exercises: 1.0.1. I suggest to use the most recent one.

Resources

Must read before to start the exercises

Biais-Variance trade off, aka Underfitting/Overfitting:

Cross-validation

Exercise 0 Environment and libraries

The goal of this exercise is to set up the Python work environment with the required libraries.

Note: For each quest, your first exercice will be to set up the virtual environment with the required libraries.

I recommend to use:

  • the last stable versions of Python.
  • the virtual environment you're the most confortable with. virtualenv and conda are the most used in Data Science.
  • one of the most recents versions of the libraries required
  1. Create a virtual environment named ex00, with a version of Python >= 3.8, with the following libraries: pandas, numpy, jupyter, matplotlib and scikit-learn.