Natural Language Processing Flashcards by Matthew Gilbert

What is the Goal of Natural Language Processing?

To make machines understand and interpret human language the way it is written or spoken.

How well did you know this?

Not at all

Perfectly

What are the two levels of Linguistic Analysis?

Syntax: What part of the given text is grammatically correct

Semantics: What is the meaning of the given text

How well did you know this?

Not at all

Perfectly

What is Natural Language Understanding?

Trying to understand the meaning of the given text

How well did you know this?

Not at all

Perfectly

What are the four ambiguities that need to be resolved for NLU?

Lexical, Syntactic, Semantic, Anaphoric

How well did you know this?

Not at all

Perfectly

What is Lexical Ambiguity?

Words have multiple meaning, also known as Polysemy or Synonomy

How well did you know this?

Not at all

Perfectly

What is Syntactic Ambiguity?

A sentence has multiple parse trees

How well did you know this?

Not at all

Perfectly

What is Semantic Ambiguity?

Sentence has multiple meanings.

How well did you know this?

Not at all

Perfectly

What is Anaphoric Ambiguity?

One word or phrase has two different meanings in the sentence.

How well did you know this?

Not at all

Perfectly

What are the four steps in the NLU process?

Syntax Analysis, Semantics, Named Entity Recognition, intent Recognition.

How well did you know this?

Not at all

Perfectly

What are the 7 steps in the NLP Pipeline?

Sentence Segmentation, Tokenization, Stemming, Part of Speech tagging, parsing, Named Entity Recognition, Co-reference (discourse) resolution.

How well did you know this?

Not at all

Perfectly

What is Sentence Segmentation?

The process of Identifying the sentence boundaries in the text.

How well did you know this?

Not at all

Perfectly

What is Tokenization?

The process of Identifiying different words, numbers, and other punctuations

How well did you know this?

Not at all

Perfectly

What is Stemming?

The process of stripping the ends of words.

How well did you know this?

Not at all

Perfectly

What is Part of Speech (POS) Tagging?

The process of assigning each word in a sentence its own part of speech tag such as designating words as nouns or verbs.

How well did you know this?

Not at all

Perfectly

What is Parsing?

The process of dividing given sentences into different categories.

How well did you know this?

Not at all

Perfectly

What is Named Identity Recognition?

The process of Identifying entities such as a person, location, or time.

What is Co-Reference (Discourse) Resolution?

The process of defining the relationship of an given word in the sentence with the next and previous sentence.

What is the goal of Lemmatization and Stemming?

The goal is to reduce the inflectional forms and derivationally related forms of a word to a common base form

What is the difference between Lemmatization and Stemming?

Stemming is a crude heuristic process that just chops the end of the word off, whereas lemmatization does it properly with the use of a vocabulary and morphological analysis of words.

What are stop words?

A list of the most common words in a language. This list is not universal and can change depending on application.

What is a “Bag-of-Words”?

A simple feature extraction techniques that describes the occurrence of each word in a document with no care for location information. The idea is that similar documents have similar contents.

What is Term Frequency-Inverse Document Frequency (TF-IDF)?

This is a statistical measure used to evaluate the importance of a word to a document or in a collection.

What is N-gram word prediction?

Using the probabilities of a sequence of words to choose the most likely next word or provide correction of spelling errors.

What is the Markov Assumption for Language?

Only prior local context, the last few words, affects the next word. This means that the probability of a word only depends on the previous N-1 words.

What are the limitations of the N-gram model?

The higher the N the better the model overall but this leads to a lot of computational overhead. N-grams are a sparse representation of a language It will be a 0 probability to all words that are not in the training corpus.