Natural Language Processing Flashcards
What is a Natural Language Processor (NLP)?
An NLP is a processor with the goal of making machines understand and interpret human language.
What is meant by linguistic analysis?
Analysis that should be undertaken before language processing to help the machine understand the formation of words and their relationships.
What is the linguistic analysis of syntax?
An analysis of what part of the text is grammatically correct.
What is the linguistic analysis of semantics?
An analysis of the meaning of the given text.
What is Natural Language Understanding (NLU)?
A Natural Language Understanding (NLU) module attempts to understand the meaning of a text. This means that the nature and structure of each individual word in the next must be known.
How does a Natural Language Understanding module understand the structure of a sentence?
By resolving ambiguities present in the natural language.
What is a lexical ambiguity?
The multiple meanings of words.
i.e. Polysemy: ‘bank’ - capital bank or a river bank.
Synonymy: ‘big’ and ‘large’ have identical meanings.
What is a syntactic ambiguity?
The multiple parse trees, or logic, of a sentence.
What is a semantic ambiguity?
The multiple meanings of a sentence.
i.e. The cat is sleeping on the couch. The couch is where the cat is sleeping.
What is an anaphoric ambiguity?
A phrase or word that refers to an entity mentioned previously in the sentence.
i.e. Lucy went to the cinema. She had fun.
‘She’ refers to ‘Lucy’, the entity mentioned prior. This is important information to understand the second sentence.
What is syntax analysis?
The analysis of sentence structure, including parts of speech (POS).
What is semantic analysis?
The analysis of the meanings of words and phrases.
What is Named Entity Recognition (NER)?
The identification of named entities in the input.
What is intent recognition?
The understanding of the speaker’s intent.
What are the four steps of natural language understanding?
Syntax analysis: structure
Semantic analysis: meaning
Named Entity Recognition (NER): entity identification
Intent Recognition: understanding
What are the primary steps of the NLP pipeline?
Segment individual sentences using punctuation. Identify different words, numbers and punctuation by tokenizing. ‘Stem’ the words by stripping the ending (i.e. ‘ending’ turns to ‘end’). Assign each word a ‘tag’ that designates the word as either a noun or adverb. Divide the next into different categories. Identify entities, and define the relationship between certain words and the entities of previous sentences.
What is sentence segmentation?
The method of separating a paragraph into its individual sentences. This is typically done using punctuation marks.
What is tokenisation?
Tokenisation is the technique of separating a sentence into a list of their words, numbers and punctuation.
What is stemming?
Stemming is the removal of the ‘endings’ of words, or their suffixes by simply chopping them off.
i.e. ‘ending’ -> ‘end’, or ‘sprinting’ -> ‘sprint’.
What is lemmatization, and how is it different from stemming?
Lemmatization is the ‘proper’ removal of word endings compared to stemming.
While stemming simply chops the words off, lemmatization attempts to remove only inflectional endings and return words to their ‘dictionary form’.
i.e. The word ‘meeting’ can either be the base form of a noun (a meeting), or a form of a verb (to meet). Stemming would always produce ‘meet’, while lemmatization could distinguish between the two.
What are stop words?
Stop words are the most common words in a language.
i.e. ‘and’, ‘the’, ‘a’, etc.
What is Part-of-Speech (POS) tagging?
POS tagging is a supervised learning solution to tag words using features such as the previous word, next word, capitalization etc. to distinguish it as one of the eight ‘parts of speech’.
What is the Bag of Words model?
The bag of words model is a feature extraction technique that simply describes how often a word appears in a document.
How can we use Bag of Words to analyze a document?
We can assume that two documents with a similar meaning will use some of the same words. Comparing the occurrence of words between documents allows us to get a base idea of what the original document is about.
i.e. We could identify a document as a literature review by comparing its contents to other literature reviews.