(Week 5) [T8] NLP Flashcards

1
Q

Talk a little bit about the classical NLP approach.

A

Can be seen as a pipeline, and its beginning is with text documents.
Preprocessing → Section bound. detect. → Sentence b. d. → Tokenizer
After the tokenizer, it could happen more actions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

There could be ambiguity and linguistic variation in NLP. Give 4 examples.

A
  • Different words with the same meaning
  • General polysemy.
  • Acronyms and domain specific language.
  • Negation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

[1st method - Simple rule-based approach]
Pros?
Cons?

A

+ Simple and still a common approach.
+ Regular expression included in many programming languages.
+ Great for semi-structured targets.

  • Patterns must consider all possible configurations.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

[2nd method - Symbolical/grammatical NLP (classical approach)]
Pros?
Cons?

A

+ Great for extracting and mapping large numbers of concepts.

  • Complex, more steps and more opportunities for error.
  • Can be slow.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

[3rd method - Machine learning]
Pros?
Cons?

A

+ Targeted approach = high accuracy.
+ Capable of learning from examples.

  • Involves manual training.
  • New target = new training effort.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Calculate the TF-IDF (term frequency - inverse document frequency) value of the following.

The term t appears 20 times in a document that contains 100 words.
We have a collection of 10000 documents, the term t appears in 100.

Interpret the value.

A

TF = 20/100 = 0.2
IDF = log_10(10000/100) = 2

TF-IDF = TF * IDF = 0.2 * 2 = 0.4.

The higher the TF, more relevant a document is.
The lower the IDF, less discriminative the term is.
The higher the TF-IDF score the more important or relevant the term is.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly