week 7 Flashcards
bag of words
one-hot encoding
text categorization
multi-class classification with softmax activation, cross entropy loss and stochastic gradient descent
issues - bag of words = sparse representation
word2vec
encoding - cbow / skip-gram
algorithm - perceptron with 1 hidden layer : context vectors for all words => hidden representation => vector containing class probabilities (class = which word based on context)
optimizations - negative sampling - softmax denominator is expensive => sample a limited number of negative samples
text indexing
lexicon + postings
lexicon - list of all words, posting list location, frequency, etc.
hash-based lexicon, O(1) lookup, collision list, updates difficult
B+ tree lexicon, O(log n) lookup, O(log n + k) range search
postings - list of documents where each word appears, position counts, term positions, etc.
stored as skip lists
document indexing
each document has a list of words (all words have identifiers)
need efficient retrieval from already parsed text(snippet generation, proximity features)
Memory-based inversion = map with words as keys and list of documents as values
sort-based inversion = sort data of limited size and write back list to memory, merge the sorted lists to create final list, needs large memory
merge-based inversion = basically same, but instead of writing back lists and merging lists, writes and merges local/partial indexes => final index
map-reduce inversion