You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
brad-gh bc5829a8e0 remove useless mention to another ex 2 years ago
..
ex00 Renaming with uppercase of readme files to respect standard 2 years ago
ex01 Renaming with uppercase of readme files to respect standard 2 years ago
ex02 remove useless mention to another ex 2 years ago
ex03 Renaming with uppercase of readme files to respect standard 2 years ago
ex04 Renaming with uppercase of readme files to respect standard 2 years ago
ex05 Renaming with uppercase of readme files to respect standard 2 years ago
ex06 Renaming with uppercase of readme files to respect standard 2 years ago
ex07 Renaming with uppercase of readme files to respect standard 2 years ago
README.md Renaming with uppercase of readme files to respect standard 2 years ago

README.md

W3D04 Piscine AI - Data Science

Natural Language processing

“NLP makes it possible for humans to talk to machines:” This branch of AI enables computers to understand, interpret, and manipulate human language. This technology is one of the most broadly applied areas of machine learning and is critical in effectively analyzing massive quantities of unstructured, text-heavy data.

Machine learning algorithms cannot work with raw text directly. Rather, the text must be converted into vectors of numbers. In natural language processing, a common technique for extracting features from text is to place all of the words that occur in the text in an unordered bucket. This aproach is called a bag of words model or BoW for short. It’s referred to as a “bag” of words because any information about the structure of the sentence is lost. This is useful to train usual machine learning models on text data. Other types of models as RNNs or LSTMs take as input a complete and ordered sequence.

Almost every Natural Language Processing (NLP) task requires text to be preprocessed before training a model. The article Your Guide to Natural Language Processing (NLP) gives a very good introduction to NLP.

Today, we we will learn to preprocess text data and to create a bag of word representation. Les packages NLTK and Spacy to do the preprocessing

Exercises of the day

  • Exercise 1 Lower case
  • Exercise 2 Punctuation
  • Exercise 3 Tokenization
  • Exercise 4 Stop words
  • Exercise 5 Stemming
  • Exercise 6 Text preprocessing
  • Exercise 7 Bag of Word representation

Virtual Environment

  • Python 3.x
  • Jupyter or JupyterLab
  • Scikit Learn
  • NLTK

I suggest to use the most recent libraries.

Ressources