Word Embeddings Flashcards
What are the problems with resources like WordNet?
- Great as a resource but missing nuance
- e.g. “proficient” is listed as a synonym for “good”.
- This is only correct in some contexts.
- Missing new meanings of words
- e.g., wicked, badass, nifty, wizard, genius, ninja, bombest
- Impossible to keep up-to-date!
- Subjective
- Requires human labor to create and adapt
- Can’t compute accurate word similarity
What are the problems with one-hot vectors?
• Example: in web search, if user searches for “Seattle motel”, we would like to match documents containing “Seattle hotel”.
• But
Motel = [0,0,0,1,0,…,0,0]
Hotel = [0,1,0,0,0,…,0,0]
• These two vectors are orthogonal à Every pair of words are orthogonal, and hence have the same “distance” to each other
How do you represent the meaning of a word?
• Distributional semantics (aka distributional hypothesis): A word’s
meaning is given by the words that frequently appear close-by.
“You should know a word by the company it keeps” (J. R. Firth)
“Words are similar, if they occur in similar contexts” (Harris)
- One of the most successful ideas of modern statistical NLP!
- When a word w appears in a text, its context is the set of words that appear nearby (within a fixed-size window)
Distributional semantics based word vector solution could be a?
Solution 1: count co-occurrence of words (co-occurrence matrix)
• Capture in which contexts a word appears
• Context is modelled using a window over the words
Detail the details with regards to the co-occurrence matrix.
- Assumption: If we collect co-occurrence counts over thousands of sentences, the vectors for “enjoy” and “like” will have similar vector representations.
- Like the document-term matrix used in Information Retrieval
- Doc-term matrix: represent a doc by words appear in the doc
- Word co-occurrence matrix: represent a word by its context (i.e. surrounding words)
- Problem:
- Vectors become very large with real data
- Workaround: apply dimensionality reduction (truncated SVD)
Problems with SVD + co-occurrence matrix?
• Instead of using the high- dimensional original co- occurrence matrix M, use U(t)
(dimension t is given by the user)
Cons: • High computational cost for large datasets • Cannot be dynamically updated with new words • Didn’t work too well in practice
How do you learn low-dimensional word vectors?
Target:
• build a dense vector for each word, chosen so that it is similar to vectors of
words that appear in similar contexts
• Dense: in contrast to co-occurrence matrix word vectors, which are sparse
(high dimensional)
• Word vectors are sometimes called word embeddings or word representations. They are a distributed representation.
What is word2vec and what is the idea of word2vec?
- Input: a large corpus of text
- Output: every word in a fixed vocabulary is represented by a vector
- Major idea:
- For words appearing in the same context, their vectors should be close
- For words not appearing in the same context, their vectors should be far away
Check slide 11 pg 18 - 22
What should you do with Negative Sampling.
Down-sample the non-contextual words, so as to make sure the numbers of
contextual and non-contextual words in O are the same
Detail information about dot product in word2vec.
- Because all embedding values are randomly drawn from the same distribution, the embedding norms are roughly the same
- Hence, to speed up computation, we ignore the computation of the denominator, and measure the similarity by computing the dot product of embeddings
What are word2vec’s two approaches?
CBOW (continuous bag-of-word): Given a context, predict the missing word
• same procedure ____ every year
• as long ___ you sing
• please stay ___ you are
Skip-gram: given a word, predict the context words
• ____ ______ as _____ ______
• If window size is two, we aim to predict:
• (w,c-2), (w,c-1), (w,c1) and (w,c2)
Note that order information is not preserved, i.e. we do not distinguish whether a word is more likely to occur before or after the word
Name Embedding Models (Two at least)
Word2vec
GloVe
Many other word embedding models; these two are (arguably) most
popular
Bengio’s NNLM vs word2vec
- Word2vec is ”simpler”
- It has no “hidden” layer
- Simpler models can be trained and run faster
- It can train on billions of word tokens in hours/days
• Uni-directional vs bi-directional
• NNLM (neural network based language model) predicts the next word using
the preceding words (left to right prediction)
• Word2vec predicts both preceding words and succeeding words (or using
context words to predict the centre word, in CBOW), hence is bi-directional
prediction
Why does word2vec work?
• Not all neural embeddings are good
• Mikolov et al. (2013) survey four models (Recurrent, MLP NNLM, CBOW, Skip- Gram)
• Some of them work quite poorly (under their parametrizations)
• Skip-Gram and CBOW are the “simplest” possible models
• Can run on much more data
• And are much faster than predecessor neural models
• “Tricks” play an important role (negative sampling)
• It took a long stream of experimentation to make neural net language models
successful (~10 years)
Why do “similar” words have similar vectors under skip-gram or CBOW?
- Assume two words have similar contexts (e.g. love/like)
* Word2vec training push words in the same context window have similar vectors (i.e. high dot product)