Chapter 2 Flashcards

Question 1

Q

Stemming

Answer

A

Process in which you use regular expressions to combine words with similar meaning

Question 2

Q

N-grams

Answer

A

Counts of Pairs of words (2-grams, 3-grams) etc that occur in sequence in a sentence. Help in retaining some of the meaning in the sentence as opposed to BOW.

Question 3

Q

Challenges with stemming

Answer

A

Difficult to remove different variations of inflection (running for example), discriminating between pluralizing “s” at the end of “words” and normal “s” as in “bus” etc

Question 4

Q

Semantic stems

Answer

A

Useful cluster of words like lemmas or synonyms

Question 5

Q

Tokenization

Answer

A

Kind of document segmentation, which breaks up text into smaller chunks or segments with more focused info. content (but in this case into tokens, instead of paragraphs, sentences or phrases)

Question 6

Q

Scanner or lexer

Answer

A

Tokenizer used for compiling computer languages

Question 7

Q

Lexicon

Answer

A

Vocabulary for a computer language

Question 8

Q

Terminal

Answer

A

Leaves in end of line for CFG grammars

Question 9

Q

One-hot vectors

Answer

A

Numerical vector representation for each word in a sentence. Each row indicates a vector for a single word in the sentence. They are typically super-sparse (containing only one zero). It is piano paper roll. Vocab. key tells which note or word to play for each row in sequence of words or piano music.

Question 10

Q

Disadvantage with one-hot vectors

Answer

A

Creates space explosion for long documents

Question 11

Q

Bag of words

Answer

A

Relies on the idea of gleaning the meaning of sentence baed on words rather than order and grammar. Compress information content for each document into data structure easier to work with. (Relies on frequency). Can be indexed to indicate which words were used in which document. Note: Important to be consistent with the order of words.

Question 12

Q

Bag of Words with dictionary

Answer

A

Saves space storing the ones and zeros instead of tuple (Pandas DataFrame series).

Question 13

Q

Dot Product

Answer

A

Way to check for similarities between sentences by counting the number of overlapping tokens. Inner product between two matrices or inner join on two tables. (A.T * B) (row vector * column vector)

Question 14

Q

Cross Product

Answer

A

Produces a vector as its output

Question 15

Q

How Regex works

Answer

A

[] - character class
+ - match must contain one or more of characters inside square brackets
\s - shortcut to predefined character class

Question 16

Q

Spacy, Stanford CoreNLP, NLTK

Answer

Study These Flashcards

A

Other libraries that implement tokenizers

Question 17

Q

Contractions

Answer

Study These Flashcards

A

Important to split wasn’t into was n’t for grammar-based NLP models that use syntax trees to separate was and not to indicate contradiction.

Question 18

Q

Problems with n-grams

Answer

Study These Flashcards

A

Most of them are pretty rare. Don’t carry correlation with other words that you can use to help identify topics or themes. Filtered out if they occur too infrequently or too often.

Question 19

Q

Stop words

Answer

Study These Flashcards

A

Common words in any language that occur a high frequency but carry little information. Excluded from most NLP tasks. Caution: Might carry information Other side: retaining them might increase the length of n-grams you need to use. Including stop words allows document filters to accurately identify and ignore words and n-grams with least information content.

Question 20

Q

Normalizing vocabulary

Answer

Study These Flashcards

A

Tokens that mean similar things are combined into a single, normalized form. Reduces the number of tokens you need to retain in your vocabulary and improves association of meaning across different spellings or n-grams.

Question 21

Q

Case folding

Answer

Study These Flashcards

A

Consolidate multiple spellings of word that differ in capitalization. Helps in reducing vocabulary size and generalize the NLP pipeline. Lose some information in the process for instance doctor and “Doctor” have two different meanings. Lowercasing everything gets rid of Camel Case information as well. Better approach is to only lower case the first letter of each word. Preserves meaning of proper nouns in the middle of the sentence. Many NLP pipelines completely avoid this to preserve meaning of proper nouns. Might be different for a search engine if you want to return both capitalized and non capitalized queries.

Question 22

Q

Stemming

Answer

Study These Flashcards

A

Eliminate small differences of pluralization or possessive endings. Identify common stem (housing, houses and house share same name). Reduces size of vocabulary while limiting loss of information and meaning. Helps with dimensionality reduction. Important for search engines as you want to return searches with common stem words.

Chapter 2 Flashcards

Chapter 2 of Manning NLP (22 cards)