Word Embeddings Flashcards

1
Q

What are the problems with resources like WordNet?

A
  • Great as a resource but missing nuance
  • e.g. “proficient” is listed as a synonym for “good”.
  • This is only correct in some contexts.
  • Missing new meanings of words
  • e.g., wicked, badass, nifty, wizard, genius, ninja, bombest
  • Impossible to keep up-to-date!
  • Subjective
  • Requires human labor to create and adapt
  • Can’t compute accurate word similarity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the problems with one-hot vectors?

A

• Example: in web search, if user searches for “Seattle motel”, we would like to match documents containing “Seattle hotel”.
• But
Motel = [0,0,0,1,0,…,0,0]
Hotel = [0,1,0,0,0,…,0,0]
• These two vectors are orthogonal à Every pair of words are orthogonal, and hence have the same “distance” to each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you represent the meaning of a word?

A

• Distributional semantics (aka distributional hypothesis): A word’s
meaning is given by the words that frequently appear close-by.

“You should know a word by the company it keeps” (J. R. Firth)
“Words are similar, if they occur in similar contexts” (Harris)

  • One of the most successful ideas of modern statistical NLP!
  • When a word w appears in a text, its context is the set of words that appear nearby (within a fixed-size window)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Distributional semantics based word vector solution could be a?

A

Solution 1: count co-occurrence of words (co-occurrence matrix)
• Capture in which contexts a word appears
• Context is modelled using a window over the words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Detail the details with regards to the co-occurrence matrix.

A
  • Assumption: If we collect co-occurrence counts over thousands of sentences, the vectors for “enjoy” and “like” will have similar vector representations.
  • Like the document-term matrix used in Information Retrieval
  • Doc-term matrix: represent a doc by words appear in the doc
  • Word co-occurrence matrix: represent a word by its context (i.e. surrounding words)
  • Problem:
  • Vectors become very large with real data
  • Workaround: apply dimensionality reduction (truncated SVD)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Problems with SVD + co-occurrence matrix?

A

• Instead of using the high- dimensional original co- occurrence matrix M, use U(t)
(dimension t is given by the user)

Cons:
• High computational cost for large 
datasets
• Cannot be dynamically updated 
with new words
• Didn’t work too well in practice
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you learn low-dimensional word vectors?

A

Target:
• build a dense vector for each word, chosen so that it is similar to vectors of
words that appear in similar contexts

• Dense: in contrast to co-occurrence matrix word vectors, which are sparse
(high dimensional)

• Word vectors are sometimes called word embeddings or word representations. They are a distributed representation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is word2vec and what is the idea of word2vec?

A
  • Input: a large corpus of text
  • Output: every word in a fixed vocabulary is represented by a vector
  • Major idea:
  • For words appearing in the same context, their vectors should be close
  • For words not appearing in the same context, their vectors should be far away

Check slide 11 pg 18 - 22

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What should you do with Negative Sampling.

A

Down-sample the non-contextual words, so as to make sure the numbers of
contextual and non-contextual words in O are the same

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Detail information about dot product in word2vec.

A
  • Because all embedding values are randomly drawn from the same distribution, the embedding norms are roughly the same
  • Hence, to speed up computation, we ignore the computation of the denominator, and measure the similarity by computing the dot product of embeddings
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are word2vec’s two approaches?

A

CBOW (continuous bag-of-word): Given a context, predict the missing word
• same procedure ____ every year
• as long ___ you sing
• please stay ___ you are

Skip-gram: given a word, predict the context words
• ____ ______ as _____ ______
• If window size is two, we aim to predict:
• (w,c-2), (w,c-1), (w,c1) and (w,c2)

Note that order information is not preserved, i.e. we do not distinguish whether a word is more likely to occur before or after the word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name Embedding Models (Two at least)

A

Word2vec
GloVe
Many other word embedding models; these two are (arguably) most
popular

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Bengio’s NNLM vs word2vec

A
  • Word2vec is ”simpler”
  • It has no “hidden” layer
  • Simpler models can be trained and run faster
  • It can train on billions of word tokens in hours/days

• Uni-directional vs bi-directional
• NNLM (neural network based language model) predicts the next word using
the preceding words (left to right prediction)
• Word2vec predicts both preceding words and succeeding words (or using
context words to predict the centre word, in CBOW), hence is bi-directional
prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why does word2vec work?

A

• Not all neural embeddings are good
• Mikolov et al. (2013) survey four models (Recurrent, MLP NNLM, CBOW, Skip- Gram)
• Some of them work quite poorly (under their parametrizations)
• Skip-Gram and CBOW are the “simplest” possible models
• Can run on much more data
• And are much faster than predecessor neural models
• “Tricks” play an important role (negative sampling)
• It took a long stream of experimentation to make neural net language models
successful (~10 years)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why do “similar” words have similar vectors under skip-gram or CBOW?

A
  • Assume two words have similar contexts (e.g. love/like)

* Word2vec training push words in the same context window have similar vectors (i.e. high dot product)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which notion of “similarity” does word2vec capture?

A

But also syntactic/morphologic similarity