Word Embeddings Flashcards

Question 1

Q

What are the problems with resources like WordNet?

Answer

A

Great as a resource but missing nuance
e.g. “proficient” is listed as a synonym for “good”.
This is only correct in some contexts.
Missing new meanings of words
e.g., wicked, badass, nifty, wizard, genius, ninja, bombest
Impossible to keep up-to-date!
Subjective
Requires human labor to create and adapt
Can’t compute accurate word similarity

Question 2

Q

What are the problems with one-hot vectors?

Answer

A

• Example: in web search, if user searches for “Seattle motel”, we would like to match documents containing “Seattle hotel”.
• But
Motel = [0,0,0,1,0,…,0,0]
Hotel = [0,1,0,0,0,…,0,0]
• These two vectors are orthogonal à Every pair of words are orthogonal, and hence have the same “distance” to each other

Question 3

Q

How do you represent the meaning of a word?

Answer

A

• Distributional semantics (aka distributional hypothesis): A word’s
meaning is given by the words that frequently appear close-by.

“You should know a word by the company it keeps” (J. R. Firth)
“Words are similar, if they occur in similar contexts” (Harris)

One of the most successful ideas of modern statistical NLP!
When a word w appears in a text, its context is the set of words that appear nearby (within a fixed-size window)

Question 4

Q

Distributional semantics based word vector solution could be a?

Answer

A

Solution 1: count co-occurrence of words (co-occurrence matrix)
• Capture in which contexts a word appears
• Context is modelled using a window over the words

Question 5

Q

Detail the details with regards to the co-occurrence matrix.

Answer

A

Assumption: If we collect co-occurrence counts over thousands of sentences, the vectors for “enjoy” and “like” will have similar vector representations.
Like the document-term matrix used in Information Retrieval
Doc-term matrix: represent a doc by words appear in the doc
Word co-occurrence matrix: represent a word by its context (i.e. surrounding words)
Problem:
Vectors become very large with real data
Workaround: apply dimensionality reduction (truncated SVD)

Question 6

Q

Problems with SVD + co-occurrence matrix?

Answer

A

• Instead of using the high- dimensional original co- occurrence matrix M, use U(t)
(dimension t is given by the user)

Cons:
• High computational cost for large 
datasets
• Cannot be dynamically updated 
with new words
• Didn’t work too well in practice

Question 7

Q

How do you learn low-dimensional word vectors?

Answer

A

Target:
• build a dense vector for each word, chosen so that it is similar to vectors of
words that appear in similar contexts

• Dense: in contrast to co-occurrence matrix word vectors, which are sparse
(high dimensional)

• Word vectors are sometimes called word embeddings or word representations. They are a distributed representation.

Question 8

Q

What is word2vec and what is the idea of word2vec?

Answer

A

Input: a large corpus of text
Output: every word in a fixed vocabulary is represented by a vector
Major idea:
For words appearing in the same context, their vectors should be close
For words not appearing in the same context, their vectors should be far away

Check slide 11 pg 18 - 22

Question 9

Q

What should you do with Negative Sampling.

Answer

A

Down-sample the non-contextual words, so as to make sure the numbers of
contextual and non-contextual words in O are the same

Question 10

Q

Detail information about dot product in word2vec.

Answer

A

Because all embedding values are randomly drawn from the same distribution, the embedding norms are roughly the same
Hence, to speed up computation, we ignore the computation of the denominator, and measure the similarity by computing the dot product of embeddings

Question 11

Q

What are word2vec’s two approaches?

Answer

A

CBOW (continuous bag-of-word): Given a context, predict the missing word
• same procedure ____ every year
• as long ___ you sing
• please stay ___ you are

Skip-gram: given a word, predict the context words
• ____ ______ as _____ ______
• If window size is two, we aim to predict:
• (w,c-2), (w,c-1), (w,c1) and (w,c2)

Note that order information is not preserved, i.e. we do not distinguish whether a word is more likely to occur before or after the word

Question 12

Q

Name Embedding Models (Two at least)

Answer

A

Word2vec
GloVe
Many other word embedding models; these two are (arguably) most
popular

Question 13

Q

Bengio’s NNLM vs word2vec

Answer

A

Word2vec is ”simpler”
It has no “hidden” layer
Simpler models can be trained and run faster
It can train on billions of word tokens in hours/days

• Uni-directional vs bi-directional
• NNLM (neural network based language model) predicts the next word using
the preceding words (left to right prediction)
• Word2vec predicts both preceding words and succeeding words (or using
context words to predict the centre word, in CBOW), hence is bi-directional
prediction

Question 14

Q

Why does word2vec work?

Answer

A

• Not all neural embeddings are good
• Mikolov et al. (2013) survey four models (Recurrent, MLP NNLM, CBOW, Skip- Gram)
• Some of them work quite poorly (under their parametrizations)
• Skip-Gram and CBOW are the “simplest” possible models
• Can run on much more data
• And are much faster than predecessor neural models
• “Tricks” play an important role (negative sampling)
• It took a long stream of experimentation to make neural net language models
successful (~10 years)

Question 15

Q

Why do “similar” words have similar vectors under skip-gram or CBOW?

Answer

A

Assume two words have similar contexts (e.g. love/like)

* Word2vec training push words in the same context window have similar vectors (i.e. high dot product)

Question 16

Q

Which notion of “similarity” does word2vec capture?

Answer

Study These Flashcards

A

But also syntactic/morphologic similarity

Word Embeddings Flashcards

(16 cards)