Natural Language Processing Flashcards
What does NLP stand for, and what does it mean?
Gives some examples of NL.
NLP stands for Natural Language Processing. These are programs which are concerned with how to analyse and process natural language data.
Examples of natural language: Speech, Text.
Give some examples of some easy and some hard NLP algorithms.
Easy: • Spell Checking • Finding Synonyms • Keyword Search • Parsing Information Hard: • Translation • Speech Recognition • Co-referencing (who does ‘she’ refer to?) • Question Answering (Especially visual)
Why is NLP hard? Give an example sentence.
NLP is hard because natural language lacks a defined structure and requires a strong understanding of context (e.g. sarcasm, idioms, emotion).
Here is an example sentence that demonstrates one of the issues:
The TV didn’t fit through the corridor because it was too wide.
The TV didn’t fit through the corridor because it was too narrow.
The subject it changes when the adjective changes because of the mechanics of the situation. This is incredibly difficult to model.
Why is NLP hard? Give an example sentence
NLP is hard because natural language lacks a defined structure and requires a strong understanding of context (e.g. sarcasm, idioms, emotion).
Here is an example sentence that demonstrates one of the issues:
The TV didn’t fit through the corridor because it was too wide.
The TV didn’t fit through the corridor because it was too narrow.
The subject it changes when the adjective changes because of the mechanics of the situation. This is incredibly difficult to model.
Explain what a term, document, and corpus is. Give examples.
These are general terms, and vary greatly in size and scale, for example:
• Corpus: A set of books
o Document: A book
Term: A word in a book
• Corpus: A set of tweets
o Document: A tweet
Term: A word in a tweet
• Corpus: A collection of articles
o Document: An article
Term: A word in an article
What is the NLP pipeline?
Text Pre Processing
Feature Extraction
Modelling
List the methods of text pre-processing?
Case Normalisation Punctuation Removal StopWord Removal Stemming Lemmatization Tokenization
Explain Case Normalisation.
Here we convert all text to the same case
Explain Punctuation Removal.
Here we simply remove all of the punctuation.
Explain StopWord Removal
Here we get rid of commonly occurring words that do not add any significant meaning to a sentence.
Explain Stemming. What is the problem with it?
Here we reduce inflection words to their root form by removing the suffix.
This is quick and simple but will not always yield real words:
• Halves -> Halv
• Caching -> Cach
Explain Lemmatization
Here we reduce inflection words to their root form by using a lookup table.
This is much slower, but much more accurate.
Explain Tokenization
This is the process of splitting the document into a list of ‘tokens’, symbols that are not split up any further.
What is an n-gram?
N-grams are a continuous sequence of n words. We can use N-grams as features.
What is Feature Extraction?
Feature extraction is the process of generating features from a text document.