week 7 Flashcards

Question 1

Q

bag of words

Answer

A

one-hot encoding

Question 2

Q

text categorization

Answer

A

multi-class classification with softmax activation, cross entropy loss and stochastic gradient descent
issues - bag of words = sparse representation

Question 3

Q

word2vec

Answer

A

encoding - cbow / skip-gram
algorithm - perceptron with 1 hidden layer : context vectors for all words => hidden representation => vector containing class probabilities (class = which word based on context)
optimizations - negative sampling - softmax denominator is expensive => sample a limited number of negative samples

Question 4

Q

text indexing

Answer

A

lexicon + postings
lexicon - list of all words, posting list location, frequency, etc.
hash-based lexicon, O(1) lookup, collision list, updates difficult
B+ tree lexicon, O(log n) lookup, O(log n + k) range search
postings - list of documents where each word appears, position counts, term positions, etc.
stored as skip lists

Question 5

Q

document indexing

Answer

A

each document has a list of words (all words have identifiers)
need efficient retrieval from already parsed text(snippet generation, proximity features)
Memory-based inversion = map with words as keys and list of documents as values
sort-based inversion = sort data of limited size and write back list to memory, merge the sorted lists to create final list, needs large memory
merge-based inversion = basically same, but instead of writing back lists and merging lists, writes and merges local/partial indexes => final index
map-reduce inversion

week 7 Flashcards

(5 cards)