feat(nlp-scraper): add information on how to calculate scandal

4 months ago · cb10cc6486
1 changed files with 16 additions and 13 deletions
--- a/subjects/ai/nlp-scraper/README.md
+++ b/subjects/ai/nlp-scraper/README.md
@ -91,19 +91,22 @@ The goal is to detect environmental disaster for the detected companies. Here
 is the methodology that should be used:

 - Define keywords that correspond to environmental disaster that may be caused
-  by companies: pollution, deforestation etc ... Here is an example of disaster
-  we want to detect: https://en.wikipedia.org/wiki/MV_Erika. Pay attention to
-  not use ambiguous words that make sense in the context of an environmental
-  disaster but also in another context. This would lead to detect a false
-  positive natural disaster.
-
- Compute the embeddings of the keywords.
-
- Compute the distance between the embeddings of the keywords and all sentences
-  that contain an entity. Explain in the `README.md` the embeddings chosen and
-  why. Similarly explain the distance or similarity chosen and why.
-
- Save the distance
+  by companies: pollution, deforestation etc ... Here is [an example of
+  disaster we want to detect](https://en.wikipedia.org/wiki/MV_Erika). Pay
+  attention to not use ambiguous words that make sense in the context of an
+  environmental disaster but also in another context. This would lead to detect
+  a false positive natural disaster.
+
+- Compute the [embeddings of the
+  keywords](https://en.wikipedia.org/wiki/Word_embedding#Software).
+
+- Compute the distance ([here some
+  examples](https://www.nltk.org/api/nltk.metrics.distance.html#module-nltk.metrics.distance))
+  between the embeddings of the keywords and all sentences that contain an
+  entity. Explain in the `README.md` the embeddings chosen and why. Similarly
+  explain the distance or similarity chosen and why.
+
+- Save a metric to unify all the distances calculated per article. 

 - Flag the top 10 articles.