You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

3.2 KiB

W2D02 Piscine AI - Data Science

Classification with Scikit Learn

The goal of this day is to understand practical classification.

Today we will learn a different approach in Machine Learning: the classification which is a large domain in the field of statistics and machine learning. Generally, it can be broken down in two areas:

  • Binary classification, where we wish to group an outcome into one of two groups.
  • Multi-class classification, where we wish to group an outcome into one of multiple (more than two) groups.

You may wonder why the approach is different from regression and why we don't use regression and define a threshold from where the class would 1 else 0 - in binary classification. The main reason is that the linear regression is sensitive to outliers, hence the treshold would vary depending on the outliers in the data. The article mentioned explains this reason with plots. To keep things simple, we can say that the output needed in classification is a probability to belong to one of the classes. So, by definition the value output by the classification model has to be between 0 and 1. The linear regression can't satisfy this constraint.

In mathematics, there are functions with nice properties that take as input a real (-inf, inf) and output a value between 0 and 1, the most popular of them is the sigmoid - which is the inverse function of the logit, hence the name logistic regression.

Let's take a small example to have a better understanding of the steps needed to perform a logistic regression on a binary data. Let's assume that we want to predict the gender given the people' size (height).

Logistic regression steps:

  • Fit a sigmoid on the training data
  • Compute sigmoid(size)=0.7 because the sigmoid returns values between 0 and 1
  • Return the class: 0.7 > 0.5 => class 1. Thus, the gender is male

For the linear regression exercises, the loss (Mean Square Error - MSE) is minimized with an algorithm called gradient descent. In the classification, the loss MSE can't be used because the output of the model is 0 or 1 (for binary classification).

The logloss or cross entropy is the loss used for classification. Similarly, it has some nice mathematical properties. The minimization of the logloss is not covered in the exercises. However, since it is used in most machine learning models for classification, I recommend to spend some time reading the related article.

Exercises of the day

  • Exercise 1 Logistic regression with Scikit-learn
  • Exercise 2 Sigmoid
  • Exercise 3 Decision boundary
  • Exercise 4 Train test split
  • Exercise 5 Breast Cancer prediction
  • Bonus: Exercise 6 Multi-class - Optional

Virtual Environment

  • Python 3.x
  • NumPy
  • Pandas
  • Matplotlib
  • Scikit Learn
  • Jupyter or JupyterLab

Version of Scikit Learn I used to do the exercises: 0.22. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years.

Ressources

Logistic regression

Logloss