Natural Language Processing Flashcards

Question 1

Q

What does NLP stand for, and what does it mean?

Gives some examples of NL.

Answer

A

NLP stands for Natural Language Processing. These are programs which are concerned with how to analyse and process natural language data.

Examples of natural language: Speech, Text.

Question 2

Q

Give some examples of some easy and some hard NLP algorithms.

Answer

A

Easy:
•	Spell Checking
•	Finding Synonyms
•	Keyword Search
•	Parsing Information
Hard:
•	Translation
•	Speech Recognition
•	Co-referencing  (who does ‘she’ refer to?)
•	Question Answering (Especially visual)

Question 3

Q

Why is NLP hard? Give an example sentence.

Answer

A

NLP is hard because natural language lacks a defined structure and requires a strong understanding of context (e.g. sarcasm, idioms, emotion).
Here is an example sentence that demonstrates one of the issues:
The TV didn’t fit through the corridor because it was too wide.
The TV didn’t fit through the corridor because it was too narrow.
The subject it changes when the adjective changes because of the mechanics of the situation. This is incredibly difficult to model.

Question 4

Q

Why is NLP hard? Give an example sentence

Answer

A

NLP is hard because natural language lacks a defined structure and requires a strong understanding of context (e.g. sarcasm, idioms, emotion).
Here is an example sentence that demonstrates one of the issues:
The TV didn’t fit through the corridor because it was too wide.
The TV didn’t fit through the corridor because it was too narrow.
The subject it changes when the adjective changes because of the mechanics of the situation. This is incredibly difficult to model.

Question 5

Q

Explain what a term, document, and corpus is. Give examples.

Answer

A

These are general terms, and vary greatly in size and scale, for example:
• Corpus: A set of books
o Document: A book
Term: A word in a book

• Corpus: A set of tweets
o Document: A tweet
Term: A word in a tweet

• Corpus: A collection of articles
o Document: An article
Term: A word in an article

Question 6

Q

What is the NLP pipeline?

Answer

A

Text Pre Processing
Feature Extraction
Modelling

Question 7

Q

List the methods of text pre-processing?

Answer

A

Case Normalisation
Punctuation Removal
StopWord Removal
Stemming
Lemmatization
Tokenization

Question 8

Q

Explain Case Normalisation.

Answer

A

Here we convert all text to the same case

Question 9

Q

Explain Punctuation Removal.

Answer

A

Here we simply remove all of the punctuation.

Question 10

Q

Explain StopWord Removal

Answer

A

Here we get rid of commonly occurring words that do not add any significant meaning to a sentence.

Question 11

Q

Explain Stemming. What is the problem with it?

Answer

A

Here we reduce inflection words to their root form by removing the suffix.

This is quick and simple but will not always yield real words:
• Halves -> Halv
• Caching -> Cach

Question 12

Q

Explain Lemmatization

Answer

A

Here we reduce inflection words to their root form by using a lookup table.
This is much slower, but much more accurate.

Question 13

Q

Explain Tokenization

Answer

A

This is the process of splitting the document into a list of ‘tokens’, symbols that are not split up any further.

Question 14

Q

What is an n-gram?

Answer

A

N-grams are a continuous sequence of n words. We can use N-grams as features.

Question 15

Q

What is Feature Extraction?

Answer

A

Feature extraction is the process of generating features from a text document.

Question 16

Q

What is Bag of Words?

Answer

A

Here each document is represented as an unordered collection of words. Each word in the document has a wordcount.

This is a good exploratory analysis tool, or you can put it into a supervised ML algorithm.

Question 17

Q

Explain TF-IDF.

Answer

A

TF-IDF (Term Frequency – Inverse Document Frequency)

Words like ‘the’ appear many times, but that doesn’t mean that they are more relevant. We can solve this issue by using TF-IDF. This gives lower weights to the most commonly occurring words. The equation is:
Frequency of the word in the document / Frequency of the word in the corpus

Question 18

Q

Explain cosine similarity

Answer

A

Here we can compare the similarity of two words within a document, using the following equation:
When we use this on two words within a document, we can compare which documents are the most similar:

Question 19

Q

What does the Naive Bayes Model do?

Answer

A

This is a model which can perform text classification. This means that it can predict which category a document belongs to.
• Which subreddit a post belongs to
• Whether a film review is positive or negative
This model is quick and gets reasonably good results but can definitely be improved upon.

Question 20

Q

How does the Naive Bayes model Work?

Answer

A

The central Idea is that each feature x will occur with a certain probability in a document if the document is in category Y.

This relies on the naïve assumption that the features are independent (which they really aren’t).

The main equation is:
P(Y | x1, x2, x3 … xp)

We use the bayes rule to calculate this using:

P(Y | x1, x2, x3 … xp) = P(x1, x2, x3 … xp | Y) * P(Y) / P(x1,x2,x3…xp)

These quantities can all be calculated from the dataset.