(Week 5) [T8] NLP Flashcards

Question 1

Q

Talk a little bit about the classical NLP approach.

Answer

A

Can be seen as a pipeline, and its beginning is with text documents.
Preprocessing → Section bound. detect. → Sentence b. d. → Tokenizer
After the tokenizer, it could happen more actions.

Question 2

Q

There could be ambiguity and linguistic variation in NLP. Give 4 examples.

Answer

A

Different words with the same meaning
General polysemy.
Acronyms and domain specific language.
Negation.

Question 3

Q

[1st method - Simple rule-based approach]
Pros?
Cons?

Answer

A

+ Simple and still a common approach.
+ Regular expression included in many programming languages.
+ Great for semi-structured targets.

Patterns must consider all possible configurations.

Question 4

Q

[2nd method - Symbolical/grammatical NLP (classical approach)]
Pros?
Cons?

Answer

A

+ Great for extracting and mapping large numbers of concepts.

Complex, more steps and more opportunities for error.
Can be slow.

Question 5

Q

[3rd method - Machine learning]
Pros?
Cons?

Answer

A

+ Targeted approach = high accuracy.
+ Capable of learning from examples.

Involves manual training.
New target = new training effort.

Question 6

Q

Calculate the TF-IDF (term frequency - inverse document frequency) value of the following.

The term t appears 20 times in a document that contains 100 words.
We have a collection of 10000 documents, the term t appears in 100.

Interpret the value.

Answer

A

TF = 20/100 = 0.2
IDF = log_10(10000/100) = 2

TF-IDF = TF * IDF = 0.2 * 2 = 0.4.

The higher the TF, more relevant a document is.
The lower the IDF, less discriminative the term is.
The higher the TF-IDF score the more important or relevant the term is.

(Week 5) [T8] NLP Flashcards

(6 cards)