Natural Language Processing Flashcards

1
Q

What does NLP stand for, and what does it mean?

Gives some examples of NL.

A

NLP stands for Natural Language Processing. These are programs which are concerned with how to analyse and process natural language data.

Examples of natural language: Speech, Text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Give some examples of some easy and some hard NLP algorithms.

A
Easy:
•	Spell Checking
•	Finding Synonyms
•	Keyword Search
•	Parsing Information
Hard:
•	Translation
•	Speech Recognition
•	Co-referencing  (who does ‘she’ refer to?)
•	Question Answering (Especially visual)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is NLP hard? Give an example sentence.

A

NLP is hard because natural language lacks a defined structure and requires a strong understanding of context (e.g. sarcasm, idioms, emotion).
Here is an example sentence that demonstrates one of the issues:
The TV didn’t fit through the corridor because it was too wide.
The TV didn’t fit through the corridor because it was too narrow.
The subject it changes when the adjective changes because of the mechanics of the situation. This is incredibly difficult to model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why is NLP hard? Give an example sentence

A

NLP is hard because natural language lacks a defined structure and requires a strong understanding of context (e.g. sarcasm, idioms, emotion).
Here is an example sentence that demonstrates one of the issues:
The TV didn’t fit through the corridor because it was too wide.
The TV didn’t fit through the corridor because it was too narrow.
The subject it changes when the adjective changes because of the mechanics of the situation. This is incredibly difficult to model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain what a term, document, and corpus is. Give examples.

A

These are general terms, and vary greatly in size and scale, for example:
• Corpus: A set of books
o Document: A book
Term: A word in a book

• Corpus: A set of tweets
o Document: A tweet
Term: A word in a tweet

• Corpus: A collection of articles
o Document: An article
Term: A word in an article

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the NLP pipeline?

A

Text Pre Processing
Feature Extraction
Modelling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

List the methods of text pre-processing?

A
Case Normalisation
Punctuation Removal
StopWord Removal
Stemming
Lemmatization
Tokenization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain Case Normalisation.

A

Here we convert all text to the same case

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain Punctuation Removal.

A

Here we simply remove all of the punctuation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain StopWord Removal

A

Here we get rid of commonly occurring words that do not add any significant meaning to a sentence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain Stemming. What is the problem with it?

A

Here we reduce inflection words to their root form by removing the suffix.

This is quick and simple but will not always yield real words:
• Halves -> Halv
• Caching -> Cach

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain Lemmatization

A

Here we reduce inflection words to their root form by using a lookup table.
This is much slower, but much more accurate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain Tokenization

A

This is the process of splitting the document into a list of ‘tokens’, symbols that are not split up any further.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is an n-gram?

A

N-grams are a continuous sequence of n words. We can use N-grams as features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Feature Extraction?

A

Feature extraction is the process of generating features from a text document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Bag of Words?

A

Here each document is represented as an unordered collection of words. Each word in the document has a wordcount.

This is a good exploratory analysis tool, or you can put it into a supervised ML algorithm.

17
Q

Explain TF-IDF.

A

TF-IDF (Term Frequency – Inverse Document Frequency)

Words like ‘the’ appear many times, but that doesn’t mean that they are more relevant. We can solve this issue by using TF-IDF. This gives lower weights to the most commonly occurring words. The equation is:
Frequency of the word in the document / Frequency of the word in the corpus

18
Q

Explain cosine similarity

A

Here we can compare the similarity of two words within a document, using the following equation:
When we use this on two words within a document, we can compare which documents are the most similar:

19
Q

What does the Naive Bayes Model do?

A

This is a model which can perform text classification. This means that it can predict which category a document belongs to.
• Which subreddit a post belongs to
• Whether a film review is positive or negative
This model is quick and gets reasonably good results but can definitely be improved upon.

20
Q

How does the Naive Bayes model Work?

A

The central Idea is that each feature x will occur with a certain probability in a document if the document is in category Y.

This relies on the naïve assumption that the features are independent (which they really aren’t).

The main equation is:
P(Y | x1, x2, x3 … xp)

We use the bayes rule to calculate this using:

P(Y | x1, x2, x3 … xp) = P(x1, x2, x3 … xp | Y) * P(Y) / P(x1,x2,x3…xp)

These quantities can all be calculated from the dataset.