Chapter 2 Flashcards
Chapter 2 of Manning NLP
Stemming
Process in which you use regular expressions to combine words with similar meaning
N-grams
Counts of Pairs of words (2-grams, 3-grams) etc that occur in sequence in a sentence. Help in retaining some of the meaning in the sentence as opposed to BOW.
Challenges with stemming
Difficult to remove different variations of inflection (running for example), discriminating between pluralizing “s” at the end of “words” and normal “s” as in “bus” etc
Semantic stems
Useful cluster of words like lemmas or synonyms
Tokenization
Kind of document segmentation, which breaks up text into smaller chunks or segments with more focused info. content (but in this case into tokens, instead of paragraphs, sentences or phrases)
Scanner or lexer
Tokenizer used for compiling computer languages
Lexicon
Vocabulary for a computer language
Terminal
Leaves in end of line for CFG grammars
One-hot vectors
Numerical vector representation for each word in a sentence. Each row indicates a vector for a single word in the sentence. They are typically super-sparse (containing only one zero). It is piano paper roll. Vocab. key tells which note or word to play for each row in sequence of words or piano music.
Disadvantage with one-hot vectors
Creates space explosion for long documents
Bag of words
Relies on the idea of gleaning the meaning of sentence baed on words rather than order and grammar. Compress information content for each document into data structure easier to work with. (Relies on frequency). Can be indexed to indicate which words were used in which document. Note: Important to be consistent with the order of words.
Bag of Words with dictionary
Saves space storing the ones and zeros instead of tuple (Pandas DataFrame series).
Dot Product
Way to check for similarities between sentences by counting the number of overlapping tokens. Inner product between two matrices or inner join on two tables. (A.T * B) (row vector * column vector)
Cross Product
Produces a vector as its output
How Regex works
[] - character class
+ - match must contain one or more of characters inside square brackets
\s - shortcut to predefined character class