Browse Source

CON-2324 nlp fix the exercise 7 (#2343)

* docs(nlp): fix the exercise 7 an the env

* docs(nlp): fix notes

* docs(nlp): fix resources
pull/2351/head
MSilva95 5 months ago committed by GitHub
parent
commit
4a69ac7acd
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
  1. 73
      subjects/ai/nlp/README.md

73
subjects/ai/nlp/README.md

@ -2,11 +2,11 @@
“NLP makes it possible for humans to talk to machines:” This branch of AI enables computers to understand, interpret, and manipulate human language. This technology is one of the most broadly applied areas of machine learning and is critical in effectively analyzing massive quantities of unstructured, text-heavy data.
Machine learning algorithms cannot work with raw text directly. Rather, the text must be converted into vectors of numbers. In natural language processing, a common technique for extracting features from text is to place all of the words that occur in the text in an unordered bucket. This aproach is called a bag of words model or BoW for short. It’s referred to as a “bag” of words because any information about the structure of the sentence is lost. This is useful to train usual machine learning models on text data. Other types of models as RNNs or LSTMs take as input a complete and ordered sequence.
Machine learning algorithms cannot work with raw text directly. Rather, the text must be converted into vectors of numbers. In natural language processing, a common technique for extracting features from text is to place all of the words that occur in the text in an unordered bucket. This approach is called a bag of words model or BoW for short. It’s referred to as a “bag” of words because any information about the structure of the sentence is lost. This is useful to train usual machine learning models on text data. Other types of models as RNNs or LSTMs take as input a complete and ordered sequence.
Almost every Natural Language Processing (NLP) task requires text to be preprocessed before training a model. The article **Your Guide to Natural Language Processing (NLP)** gives a very good introduction to NLP.
Today, we we will learn to preprocess text data and to create a bag of word representation. Les packages NLTK and Spacy to do the preprocessing
Today, we we will learn to preprocess text data and to create a bag of word representation. The packages NLTK and Spacy to do the preprocessing
### Exercises of the day
@ -24,8 +24,9 @@ Today, we we will learn to preprocess text data and to create a bag of word repr
- Python 3.x
- Jupyter or JupyterLab
- Pandas
- Scikit Learn
- Scikit-learn
- NLTK
- Tabulate
I suggest to use the most recent libraries.
@ -43,13 +44,13 @@ I suggest to use the most recent libraries.
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
> Note: For each quest, your first exercise will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
- the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recent versions of the libraries required
1. Create a virtual environment named with a version of Python >= `3.8`, with the following libraries: `pandas`, `jupyter`, `nltk` and `scikit-learn`.
@ -70,15 +71,15 @@ series_data = pd.Series(list_, name='text')
1. Print all texts in lowercase
2. Print all texts in upper case
Note: Do not change the text manually!
> Note: Do not change the text manually!
---
---
# Exerice 2: Punctuation
# Exercise 2: Punctuation
The goal of this exerice is to learn to deal with punctuation. In Natural Language Processing, some basic approaches as Bag of Words model the text as an unordered combination of words. In that case the punctuation is not always useful as it doesn't add information to the model. That is why is removed.
The goal of this exercise is to learn to deal with punctuation. In Natural Language Processing, some basic approaches as Bag of Words model the text as an unordered combination of words. In that case the punctuation is not always useful as it doesn't add information to the model. That is why is removed.
1. Remove the punctuation from this sentence. All characters in !"#$%&'()\*+,-./:;<=>?@[\]^\_`{|}~ are considered as punctuation.
@ -105,8 +106,7 @@ text = """Bitcoin is a cryptocurrency invented in 2008 by an unknown person or g
2. Tokenize this text using `word_tokenize` from NLTK.
_Ressource_:
https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/
_Resources: [How to Get Started with NLP – 6](https://www.analyticsvidhya.com/blog/2019/07how-get-started-nlp-6-unique-ways-perform-tokenization/)_
---
@ -132,7 +132,7 @@ The goal of this exercise is to learn to remove stop words with NLTK. Stop word
The goal of this exercise is to learn to use stemming using NLTK. As explained in details in the article, stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
Note: The output of a stemmer is a word that may not exist in the dictionnary.
> Note: The output of a stemmer is a word that may not exist in the dictionary.
```
text = """
@ -153,7 +153,7 @@ The goal of this exercise is to learn to create a function to prepocess and clea
Put this text in a variable:
```
01 Edu System presents an innovative curriculum in software engineering and programming. With a renowned industry-leading reputation, the curriculum has been rigorously designed for learning skills of the digital world and technology industry. Taking a different approach than the classic teaching methods today, learning is facilitated through a collective and co-créative process in a professional environment.
01 Edu System presents an innovative curriculum in software engineering and programming. With a renowned industry-leading reputation, the curriculum has been rigorously designed for learning skills of the digital world and technology industry. Taking a different approach than the classic teaching methods today, learning is facilitated through a collective and co-creative process in a professional environment.
```
@ -175,28 +175,28 @@ _Ressources: https://towardsdatascience.com/nlp-preprocessing-with-nltk-3c04ee00
# Exercise 7: Bag of Word representation
The goal of this exercise is to understand how to create a Bag of Word (BoW) model on a corpus of texts. More precesily we will create a labeled data set from textual data using a word count matrix.
The goal of this exercise is to understand the creation of a Bag of Word (BoW) model for a corpus of texts and create a labeled dataset from textual data using a word count matrix.
_Ressources: https://machinelearningmastery.com/gentle-introduction-bag-words-model/_
_Resources: [Gentle Introduction to Bag of Words Model](https://machinelearningmastery.com/gentle-introduction-bag-words-model/)_
As explained in the ressource, the Bag of word reprensation makes the assumption that the order in which the words appear in a text doesn't matter. There are different types of Bag of words reprensations:
The Bag of Word representation assumes that word order in a text is irrelevant. There are various types of Bag of Words representations:
- Boolean: Each document is a boolean vector
- Wordcount: Each document is a word count vector
- TFIDF: Each document is a score vector. The score is detailed in the next exercise.
- Boolean: Each document represented by a boolean vector
- Word count: Each document represented by a word count vector
- TFIDF: Each document represented by a score vector (more detailed in the next exercise)
The data file [tweets_train.txt](resources/tweets_train.txt) contains tweets labeled with a sentiment. It gives the positivity of a tweet.
The data file [tweets_train.txt](resources/tweets_train.txt) contains labeled tweets indicating their positivity.
Steps:
1. Preprocess the data using the function implemented in the previous exercise. And, using from `CountVectorizer` of scikitlearn with `max_features=500` compute the wordcount of the tweets. The output is a sparse matrix.
1. Preprocess the data using the function implemented earlier. Then, use `CountVectorizer` from scikit-learn with `max_features=500` to compute the word count of the tweets. The output is a sparse matrix.
- Check the shape of the word count matrix
- Set **max_features** to 500 of the initial size of the dictionnary.
- Check the shape of the word count matrix.
- Set **max_features** to 500, considering the initial size of the dictionary.
Reminder: Given that a data set is often described as an m x n matrix in which m is the number of rows and n is the number of columns: features. It is strongly recommanded to work with m >> n. The value of the ratio depends on the signal existing in the data set and on the model complexity.
> Note: Given that a data set is often described as an m x n matrix in which m is the number of rows and n is the number of columns: features. It is strongly recommended to work with m >> n. The value of the ratio depends on the signal existing in the data set and on the model complexity.
2. Using from_spmatrix from scikitlearn create a DataFrame with documents in rows and dictionary in columns.
2. Using `from_spmatrix` from Pandas, create a DataFrame with documents in rows and the dictionary in columns.
| | and | boat | compute |
| --: | --: | ---: | ------: |
@ -204,15 +204,18 @@ Steps:
| 1 | 0 | 0 | 1 |
| 2 | 1 | 0 | 0 |
3. Create a dataframe with the labels
- 1: positive
- 0: neutral
- -1: negative
> Note: The sample 3x3 table mentioned is a small representation of the expected output for demonstration purposes. It's not necessary to drop columns in this context.
| | target |
| --: | -----: |
| 0 | -1 |
| 1 | 0 |
| 2 | 1 |
3. Create a DataFrame with labels where:
_Ressources: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html_
- 1: Positive
- 0: Neutral
- -1: Negative
| | Label |
| --: | ----: |
| 0 | -1 |
| 1 | 0 |
| 2 | 1 |
_Resources: [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)_

Loading…
Cancel
Save