week 7 Flashcards

1
Q

bag of words

A

one-hot encoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

text categorization

A

multi-class classification with softmax activation, cross entropy loss and stochastic gradient descent
issues - bag of words = sparse representation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

word2vec

A

encoding - cbow / skip-gram
algorithm - perceptron with 1 hidden layer : context vectors for all words => hidden representation => vector containing class probabilities (class = which word based on context)
optimizations - negative sampling - softmax denominator is expensive => sample a limited number of negative samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

text indexing

A

lexicon + postings
lexicon - list of all words, posting list location, frequency, etc.
hash-based lexicon, O(1) lookup, collision list, updates difficult
B+ tree lexicon, O(log n) lookup, O(log n + k) range search
postings - list of documents where each word appears, position counts, term positions, etc.
stored as skip lists

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

document indexing

A

each document has a list of words (all words have identifiers)
need efficient retrieval from already parsed text(snippet generation, proximity features)
Memory-based inversion = map with words as keys and list of documents as values
sort-based inversion = sort data of limited size and write back list to memory, merge the sorted lists to create final list, needs large memory
merge-based inversion = basically same, but instead of writing back lists and merging lists, writes and merges local/partial indexes => final index
map-reduce inversion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly