Browse Source

feat(nlp-scraper): add information on how to calculate scandal

pull/2419/head
nprimo 4 months ago
parent
commit
cb10cc6486
  1. 29
      subjects/ai/nlp-scraper/README.md

29
subjects/ai/nlp-scraper/README.md

@ -91,19 +91,22 @@ The goal is to detect environmental disaster for the detected companies. Here
is the methodology that should be used:
- Define keywords that correspond to environmental disaster that may be caused
by companies: pollution, deforestation etc ... Here is an example of disaster
we want to detect: https://en.wikipedia.org/wiki/MV_Erika. Pay attention to
not use ambiguous words that make sense in the context of an environmental
disaster but also in another context. This would lead to detect a false
positive natural disaster.
- Compute the embeddings of the keywords.
- Compute the distance between the embeddings of the keywords and all sentences
that contain an entity. Explain in the `README.md` the embeddings chosen and
why. Similarly explain the distance or similarity chosen and why.
- Save the distance
by companies: pollution, deforestation etc ... Here is [an example of
disaster we want to detect](https://en.wikipedia.org/wiki/MV_Erika). Pay
attention to not use ambiguous words that make sense in the context of an
environmental disaster but also in another context. This would lead to detect
a false positive natural disaster.
- Compute the [embeddings of the
keywords](https://en.wikipedia.org/wiki/Word_embedding#Software).
- Compute the distance ([here some
examples](https://www.nltk.org/api/nltk.metrics.distance.html#module-nltk.metrics.distance))
between the embeddings of the keywords and all sentences that contain an
entity. Explain in the `README.md` the embeddings chosen and why. Similarly
explain the distance or similarity chosen and why.
- Save a metric to unify all the distances calculated per article.
- Flag the top 10 articles.

Loading…
Cancel
Save