Natural Language Processing Flashcards
What is the Goal of Natural Language Processing?
To make machines understand and interpret human language the way it is written or spoken.
What are the two levels of Linguistic Analysis?
Syntax: What part of the given text is grammatically correct
Semantics: What is the meaning of the given text
What is Natural Language Understanding?
Trying to understand the meaning of the given text
What are the four ambiguities that need to be resolved for NLU?
Lexical, Syntactic, Semantic, Anaphoric
What is Lexical Ambiguity?
Words have multiple meaning, also known as Polysemy or Synonomy
What is Syntactic Ambiguity?
A sentence has multiple parse trees
What is Semantic Ambiguity?
Sentence has multiple meanings.
What is Anaphoric Ambiguity?
One word or phrase has two different meanings in the sentence.
What are the four steps in the NLU process?
Syntax Analysis, Semantics, Named Entity Recognition, intent Recognition.
What are the 7 steps in the NLP Pipeline?
Sentence Segmentation, Tokenization, Stemming, Part of Speech tagging, parsing, Named Entity Recognition, Co-reference (discourse) resolution.
What is Sentence Segmentation?
The process of Identifying the sentence boundaries in the text.
What is Tokenization?
The process of Identifiying different words, numbers, and other punctuations
What is Stemming?
The process of stripping the ends of words.
What is Part of Speech (POS) Tagging?
The process of assigning each word in a sentence its own part of speech tag such as designating words as nouns or verbs.
What is Parsing?
The process of dividing given sentences into different categories.
What is Named Identity Recognition?
The process of Identifying entities such as a person, location, or time.
What is Co-Reference (Discourse) Resolution?
The process of defining the relationship of an given word in the sentence with the next and previous sentence.
What is the goal of Lemmatization and Stemming?
The goal is to reduce the inflectional forms and derivationally related forms of a word to a common base form
What is the difference between Lemmatization and Stemming?
Stemming is a crude heuristic process that just chops the end of the word off, whereas lemmatization does it properly with the use of a vocabulary and morphological analysis of words.
What are stop words?
A list of the most common words in a language. This list is not universal and can change depending on application.
What is a “Bag-of-Words”?
A simple feature extraction techniques that describes the occurrence of each word in a document with no care for location information. The idea is that similar documents have similar contents.
What is Term Frequency-Inverse Document Frequency (TF-IDF)?
This is a statistical measure used to evaluate the importance of a word to a document or in a collection.
What is N-gram word prediction?
Using the probabilities of a sequence of words to choose the most likely next word or provide correction of spelling errors.
What is the Markov Assumption for Language?
Only prior local context, the last few words, affects the next word. This means that the probability of a word only depends on the previous N-1 words.
What are the limitations of the N-gram model?
The higher the N the better the model overall but this leads to a lot of computational overhead.
N-grams are a sparse representation of a language
It will be a 0 probability to all words that are not in the training corpus.