Chapter 2 Flashcards

Chapter 2 of Manning NLP

1
Q

Stemming

A

Process in which you use regular expressions to combine words with similar meaning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

N-grams

A

Counts of Pairs of words (2-grams, 3-grams) etc that occur in sequence in a sentence. Help in retaining some of the meaning in the sentence as opposed to BOW.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Challenges with stemming

A

Difficult to remove different variations of inflection (running for example), discriminating between pluralizing “s” at the end of “words” and normal “s” as in “bus” etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Semantic stems

A

Useful cluster of words like lemmas or synonyms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Tokenization

A

Kind of document segmentation, which breaks up text into smaller chunks or segments with more focused info. content (but in this case into tokens, instead of paragraphs, sentences or phrases)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Scanner or lexer

A

Tokenizer used for compiling computer languages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Lexicon

A

Vocabulary for a computer language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Terminal

A

Leaves in end of line for CFG grammars

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

One-hot vectors

A

Numerical vector representation for each word in a sentence. Each row indicates a vector for a single word in the sentence. They are typically super-sparse (containing only one zero). It is piano paper roll. Vocab. key tells which note or word to play for each row in sequence of words or piano music.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Disadvantage with one-hot vectors

A

Creates space explosion for long documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Bag of words

A

Relies on the idea of gleaning the meaning of sentence baed on words rather than order and grammar. Compress information content for each document into data structure easier to work with. (Relies on frequency). Can be indexed to indicate which words were used in which document. Note: Important to be consistent with the order of words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Bag of Words with dictionary

A

Saves space storing the ones and zeros instead of tuple (Pandas DataFrame series).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Dot Product

A

Way to check for similarities between sentences by counting the number of overlapping tokens. Inner product between two matrices or inner join on two tables. (A.T * B) (row vector * column vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Cross Product

A

Produces a vector as its output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How Regex works

A

[] - character class
+ - match must contain one or more of characters inside square brackets
\s - shortcut to predefined character class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Spacy, Stanford CoreNLP, NLTK

A

Other libraries that implement tokenizers

17
Q

Contractions

A

Important to split wasn’t into was n’t for grammar-based NLP models that use syntax trees to separate was and not to indicate contradiction.

18
Q

Problems with n-grams

A

Most of them are pretty rare. Don’t carry correlation with other words that you can use to help identify topics or themes. Filtered out if they occur too infrequently or too often.

19
Q

Stop words

A

Common words in any language that occur a high frequency but carry little information. Excluded from most NLP tasks. Caution: Might carry information Other side: retaining them might increase the length of n-grams you need to use. Including stop words allows document filters to accurately identify and ignore words and n-grams with least information content.

20
Q

Normalizing vocabulary

A

Tokens that mean similar things are combined into a single, normalized form. Reduces the number of tokens you need to retain in your vocabulary and improves association of meaning across different spellings or n-grams.

21
Q

Case folding

A

Consolidate multiple spellings of word that differ in capitalization. Helps in reducing vocabulary size and generalize the NLP pipeline. Lose some information in the process for instance doctor and “Doctor” have two different meanings. Lowercasing everything gets rid of Camel Case information as well. Better approach is to only lower case the first letter of each word. Preserves meaning of proper nouns in the middle of the sentence. Many NLP pipelines completely avoid this to preserve meaning of proper nouns. Might be different for a search engine if you want to return both capitalized and non capitalized queries.

22
Q

Stemming

A

Eliminate small differences of pluralization or possessive endings. Identify common stem (housing, houses and house share same name). Reduces size of vocabulary while limiting loss of information and meaning. Helps with dimensionality reduction. Important for search engines as you want to return searches with common stem words.