Chapter 2 Flashcards
Chapter 2 of Manning NLP
Stemming
Process in which you use regular expressions to combine words with similar meaning
N-grams
Counts of Pairs of words (2-grams, 3-grams) etc that occur in sequence in a sentence. Help in retaining some of the meaning in the sentence as opposed to BOW.
Challenges with stemming
Difficult to remove different variations of inflection (running for example), discriminating between pluralizing “s” at the end of “words” and normal “s” as in “bus” etc
Semantic stems
Useful cluster of words like lemmas or synonyms
Tokenization
Kind of document segmentation, which breaks up text into smaller chunks or segments with more focused info. content (but in this case into tokens, instead of paragraphs, sentences or phrases)
Scanner or lexer
Tokenizer used for compiling computer languages
Lexicon
Vocabulary for a computer language
Terminal
Leaves in end of line for CFG grammars
One-hot vectors
Numerical vector representation for each word in a sentence. Each row indicates a vector for a single word in the sentence. They are typically super-sparse (containing only one zero). It is piano paper roll. Vocab. key tells which note or word to play for each row in sequence of words or piano music.
Disadvantage with one-hot vectors
Creates space explosion for long documents
Bag of words
Relies on the idea of gleaning the meaning of sentence baed on words rather than order and grammar. Compress information content for each document into data structure easier to work with. (Relies on frequency). Can be indexed to indicate which words were used in which document. Note: Important to be consistent with the order of words.
Bag of Words with dictionary
Saves space storing the ones and zeros instead of tuple (Pandas DataFrame series).
Dot Product
Way to check for similarities between sentences by counting the number of overlapping tokens. Inner product between two matrices or inner join on two tables. (A.T * B) (row vector * column vector)
Cross Product
Produces a vector as its output
How Regex works
[] - character class
+ - match must contain one or more of characters inside square brackets
\s - shortcut to predefined character class
Spacy, Stanford CoreNLP, NLTK
Other libraries that implement tokenizers
Contractions
Important to split wasn’t into was n’t for grammar-based NLP models that use syntax trees to separate was and not to indicate contradiction.
Problems with n-grams
Most of them are pretty rare. Don’t carry correlation with other words that you can use to help identify topics or themes. Filtered out if they occur too infrequently or too often.
Stop words
Common words in any language that occur a high frequency but carry little information. Excluded from most NLP tasks. Caution: Might carry information Other side: retaining them might increase the length of n-grams you need to use. Including stop words allows document filters to accurately identify and ignore words and n-grams with least information content.
Normalizing vocabulary
Tokens that mean similar things are combined into a single, normalized form. Reduces the number of tokens you need to retain in your vocabulary and improves association of meaning across different spellings or n-grams.
Case folding
Consolidate multiple spellings of word that differ in capitalization. Helps in reducing vocabulary size and generalize the NLP pipeline. Lose some information in the process for instance doctor and “Doctor” have two different meanings. Lowercasing everything gets rid of Camel Case information as well. Better approach is to only lower case the first letter of each word. Preserves meaning of proper nouns in the middle of the sentence. Many NLP pipelines completely avoid this to preserve meaning of proper nouns. Might be different for a search engine if you want to return both capitalized and non capitalized queries.
Stemming
Eliminate small differences of pluralization or possessive endings. Identify common stem (housing, houses and house share same name). Reduces size of vocabulary while limiting loss of information and meaning. Helps with dimensionality reduction. Important for search engines as you want to return searches with common stem words.