(Week 5) [T8] NLP Flashcards
Talk a little bit about the classical NLP approach.
Can be seen as a pipeline, and its beginning is with text documents.
Preprocessing → Section bound. detect. → Sentence b. d. → Tokenizer
After the tokenizer, it could happen more actions.
There could be ambiguity and linguistic variation in NLP. Give 4 examples.
- Different words with the same meaning
- General polysemy.
- Acronyms and domain specific language.
- Negation.
[1st method - Simple rule-based approach]
Pros?
Cons?
+ Simple and still a common approach.
+ Regular expression included in many programming languages.
+ Great for semi-structured targets.
- Patterns must consider all possible configurations.
[2nd method - Symbolical/grammatical NLP (classical approach)]
Pros?
Cons?
+ Great for extracting and mapping large numbers of concepts.
- Complex, more steps and more opportunities for error.
- Can be slow.
[3rd method - Machine learning]
Pros?
Cons?
+ Targeted approach = high accuracy.
+ Capable of learning from examples.
- Involves manual training.
- New target = new training effort.
Calculate the TF-IDF (term frequency - inverse document frequency) value of the following.
The term t appears 20 times in a document that contains 100 words.
We have a collection of 10000 documents, the term t appears in 100.
Interpret the value.
TF = 20/100 = 0.2
IDF = log_10(10000/100) = 2
TF-IDF = TF * IDF = 0.2 * 2 = 0.4.
The higher the TF, more relevant a document is.
The lower the IDF, less discriminative the term is.
The higher the TF-IDF score the more important or relevant the term is.