Natural Language Processing Flashcards
What does NLP stand for, and what does it mean?
Gives some examples of NL.
NLP stands for Natural Language Processing. These are programs which are concerned with how to analyse and process natural language data.
Examples of natural language: Speech, Text.
Give some examples of some easy and some hard NLP algorithms.
Easy: • Spell Checking • Finding Synonyms • Keyword Search • Parsing Information Hard: • Translation • Speech Recognition • Co-referencing (who does ‘she’ refer to?) • Question Answering (Especially visual)
Why is NLP hard? Give an example sentence.
NLP is hard because natural language lacks a defined structure and requires a strong understanding of context (e.g. sarcasm, idioms, emotion).
Here is an example sentence that demonstrates one of the issues:
The TV didn’t fit through the corridor because it was too wide.
The TV didn’t fit through the corridor because it was too narrow.
The subject it changes when the adjective changes because of the mechanics of the situation. This is incredibly difficult to model.
Why is NLP hard? Give an example sentence
NLP is hard because natural language lacks a defined structure and requires a strong understanding of context (e.g. sarcasm, idioms, emotion).
Here is an example sentence that demonstrates one of the issues:
The TV didn’t fit through the corridor because it was too wide.
The TV didn’t fit through the corridor because it was too narrow.
The subject it changes when the adjective changes because of the mechanics of the situation. This is incredibly difficult to model.
Explain what a term, document, and corpus is. Give examples.
These are general terms, and vary greatly in size and scale, for example:
• Corpus: A set of books
o Document: A book
Term: A word in a book
• Corpus: A set of tweets
o Document: A tweet
Term: A word in a tweet
• Corpus: A collection of articles
o Document: An article
Term: A word in an article
What is the NLP pipeline?
Text Pre Processing
Feature Extraction
Modelling
List the methods of text pre-processing?
Case Normalisation Punctuation Removal StopWord Removal Stemming Lemmatization Tokenization
Explain Case Normalisation.
Here we convert all text to the same case
Explain Punctuation Removal.
Here we simply remove all of the punctuation.
Explain StopWord Removal
Here we get rid of commonly occurring words that do not add any significant meaning to a sentence.
Explain Stemming. What is the problem with it?
Here we reduce inflection words to their root form by removing the suffix.
This is quick and simple but will not always yield real words:
• Halves -> Halv
• Caching -> Cach
Explain Lemmatization
Here we reduce inflection words to their root form by using a lookup table.
This is much slower, but much more accurate.
Explain Tokenization
This is the process of splitting the document into a list of ‘tokens’, symbols that are not split up any further.
What is an n-gram?
N-grams are a continuous sequence of n words. We can use N-grams as features.
What is Feature Extraction?
Feature extraction is the process of generating features from a text document.
What is Bag of Words?
Here each document is represented as an unordered collection of words. Each word in the document has a wordcount.
This is a good exploratory analysis tool, or you can put it into a supervised ML algorithm.
Explain TF-IDF.
TF-IDF (Term Frequency – Inverse Document Frequency)
Words like ‘the’ appear many times, but that doesn’t mean that they are more relevant. We can solve this issue by using TF-IDF. This gives lower weights to the most commonly occurring words. The equation is:
Frequency of the word in the document / Frequency of the word in the corpus
Explain cosine similarity
Here we can compare the similarity of two words within a document, using the following equation:
When we use this on two words within a document, we can compare which documents are the most similar:
What does the Naive Bayes Model do?
This is a model which can perform text classification. This means that it can predict which category a document belongs to.
• Which subreddit a post belongs to
• Whether a film review is positive or negative
This model is quick and gets reasonably good results but can definitely be improved upon.
How does the Naive Bayes model Work?
The central Idea is that each feature x will occur with a certain probability in a document if the document is in category Y.
This relies on the naïve assumption that the features are independent (which they really aren’t).
The main equation is:
P(Y | x1, x2, x3 … xp)
We use the bayes rule to calculate this using:
P(Y | x1, x2, x3 … xp) = P(x1, x2, x3 … xp | Y) * P(Y) / P(x1,x2,x3…xp)
These quantities can all be calculated from the dataset.