Natural Language Processing Flashcards

1
Q

What is a Natural Language Processor (NLP)?

A

An NLP is a processor with the goal of making machines understand and interpret human language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is meant by linguistic analysis?

A

Analysis that should be undertaken before language processing to help the machine understand the formation of words and their relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the linguistic analysis of syntax?

A

An analysis of what part of the text is grammatically correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the linguistic analysis of semantics?

A

An analysis of the meaning of the given text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Natural Language Understanding (NLU)?

A

A Natural Language Understanding (NLU) module attempts to understand the meaning of a text. This means that the nature and structure of each individual word in the next must be known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does a Natural Language Understanding module understand the structure of a sentence?

A

By resolving ambiguities present in the natural language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a lexical ambiguity?

A

The multiple meanings of words.

i.e. Polysemy: ‘bank’ - capital bank or a river bank.
Synonymy: ‘big’ and ‘large’ have identical meanings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a syntactic ambiguity?

A

The multiple parse trees, or logic, of a sentence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a semantic ambiguity?

A

The multiple meanings of a sentence.

i.e. The cat is sleeping on the couch. The couch is where the cat is sleeping.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is an anaphoric ambiguity?

A

A phrase or word that refers to an entity mentioned previously in the sentence.

i.e. Lucy went to the cinema. She had fun.
‘She’ refers to ‘Lucy’, the entity mentioned prior. This is important information to understand the second sentence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is syntax analysis?

A

The analysis of sentence structure, including parts of speech (POS).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is semantic analysis?

A

The analysis of the meanings of words and phrases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Named Entity Recognition (NER)?

A

The identification of named entities in the input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is intent recognition?

A

The understanding of the speaker’s intent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the four steps of natural language understanding?

A

Syntax analysis: structure
Semantic analysis: meaning
Named Entity Recognition (NER): entity identification
Intent Recognition: understanding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the primary steps of the NLP pipeline?

A

Segment individual sentences using punctuation. Identify different words, numbers and punctuation by tokenizing. ‘Stem’ the words by stripping the ending (i.e. ‘ending’ turns to ‘end’). Assign each word a ‘tag’ that designates the word as either a noun or adverb. Divide the next into different categories. Identify entities, and define the relationship between certain words and the entities of previous sentences.

17
Q

What is sentence segmentation?

A

The method of separating a paragraph into its individual sentences. This is typically done using punctuation marks.

18
Q

What is tokenisation?

A

Tokenisation is the technique of separating a sentence into a list of their words, numbers and punctuation.

19
Q

What is stemming?

A

Stemming is the removal of the ‘endings’ of words, or their suffixes by simply chopping them off.

i.e. ‘ending’ -> ‘end’, or ‘sprinting’ -> ‘sprint’.

20
Q

What is lemmatization, and how is it different from stemming?

A

Lemmatization is the ‘proper’ removal of word endings compared to stemming.

While stemming simply chops the words off, lemmatization attempts to remove only inflectional endings and return words to their ‘dictionary form’.

i.e. The word ‘meeting’ can either be the base form of a noun (a meeting), or a form of a verb (to meet). Stemming would always produce ‘meet’, while lemmatization could distinguish between the two.

21
Q

What are stop words?

A

Stop words are the most common words in a language.

i.e. ‘and’, ‘the’, ‘a’, etc.

22
Q

What is Part-of-Speech (POS) tagging?

A

POS tagging is a supervised learning solution to tag words using features such as the previous word, next word, capitalization etc. to distinguish it as one of the eight ‘parts of speech’.

23
Q

What is the Bag of Words model?

A

The bag of words model is a feature extraction technique that simply describes how often a word appears in a document.

24
Q

How can we use Bag of Words to analyze a document?

A

We can assume that two documents with a similar meaning will use some of the same words. Comparing the occurrence of words between documents allows us to get a base idea of what the original document is about.

i.e. We could identify a document as a literature review by comparing its contents to other literature reviews.

25
Q

What is the purpose of the TF-IDF rating?

A

TF-IDF is a statistical measure used to evaluate the importance of a term to a document.

It considers how often the term appears in that document - Term Frequency - and the number of documents with that term in it - Inverse Term Frequency.

26
Q

What is word prediction?

A

The use of the probabilities of sequences of words to calculate the likeliest next word. This is typically done through MLE or Chain Rule.

27
Q

What is Chain Rule?

A

Chain Rule is a method used to calculate the probability of a sequence of words. It considers that probability as the product of the probability that Wn will occur given Wn-1, Wn-2, …, W1.

i.e. P(W1, W2, W3) = P(W1) * P(W2|W1) * P(W3|W2W1)

28
Q

What is the Markov Assumption?

A

The Markov Assumption is an assumption we make when calculating word probability.

We assume that only prior local context affects the next word. We do this to avoid the scaling problem we face when dealing with large documents.

29
Q

What is an N-gram?

A

An N-gram is an application of the Markov Assumption where we assume that the probability of a word only depends on the previous n-1 words.

An N-gram where N is 2 is called a BIGRAM. An N-gram where N is 3 is called a TRIGRAM.

30
Q

What is the benefit to using N-grams?

A

In large documents, sometimes 1000s of words long, we can ensure that the word chain is only 4 elements long, reducing the computation time.

31
Q

What are some limitations to using N-grams?

A

The higher the N, the better the model. However, the higher the N, the higher the computational overhead.

Furthermore, they are a sparse representation of language - it will simply give a probability of 0 to all words not in the training corpus.

32
Q

What is the Maximum Likelihood Estimate (MLE)?

A

MLE is a way of calculating the probability that a word will occur given a prior sequence.

It is equal to the number of times the resulting sequence exists divided by the number of times that sequence does not exist.