C3 Flashcards
vector space model
dimensions are the words, documents are vectors: shows which words a document contains
2 problems vector space model
synonymy: many ways to refer to the same object (bike and bicycle)
polysemy: many words have more than one distinct meaning
word embeddings model
represent words in a continuous vector space
- relatively low dimensional vector space
- semantically and syntactically similar words are mapped to nearby points (distributional hypothesis)
feedforward network
multilayer network in which the units are connected with no cycles
- three kinds of nodes: input, hidden, output
- each layer is fully connected (in standard architecture)
classification with feedforward network
Binary: single output node
multi-class: output node for each category, output layer gives probability distribution
word2vec
train a neural classifier on a binary prediction task: is word w likely to show up near the word bicycle? => Take the learned classifier weights on the hidden layer as the word embeddings
computationally efficient predictive model for learning word embeddings from raw text
training word2vec
supervised learning problem on unlabeled data: self-supervision
- treat target word and a neighbouring context word as positive examples
- randomly sample other words in the lexicon to get negative samples
- train a classifier to distinguish those two cases
- learned weights are the embeddings
- maximize similarity of (target word, context word) pairs drawn from positive examples
- minimize similarity of (w,c_neg) pairs from negative examples
- each word + context is a classification problem (adjust current vector or not)
=> weights get updated while the model processes the collection (minimize and maximize dot-products)
3 advantages of word2vec
- it scales: can be trained on billion word corpora in limited time + possibility of parallel training
- pre-trained word embeddings trained by one can be used by others
- incremental training: train one piece, save results, continue later
4 text mining tasks that benefit from using word2vec
- synonym detection
- richer word and context representation for named entity recognition
- document similarity
- finding word associations/clusters
tf-idf
w_t,d = tf_t,d * idf_t
tf = term frequency of word t in document d = 1 + log_10(count(t,d))
idf = inverse document frequency = log_10(N/df_t) with N the total number of documents and df_t the number of documents t occurs in
PMI
Pointwise Mutual Information
PMI(w,c) = log_2(P(w,c)/P(w)P(c))
estimate of how much more the two words co-occur than we expect by chance
first-order co-occurrence
when two words are typically nearby each other (“wrote” and “book”)
second-order co-occurrence
when two words have similar neighbours (“wrote” and “said”)
distibutional hypothesis
words that occur in similar contexts tend to be similar (the context of a word defines its meaning)
proposed neural architectures for computing word vectors
- word2vec
- GloVe
- FastText
- ELMo
- BERT