Lecture 10 - Word Vectors and Vector Semantics Flashcards

1
Q

What is denotational semantics?

A

Denotational semantics is an approach to formalizing the meanings of programming languages by constructing mathematical objects (called denotations) that describe the meanings of expressions from the languages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the most common solution to get usable meaning from a text that can be processed by a computer?

A

Using thesauri like WordNet, which contains list of synonym sets and hyponyms (“is a” relationships)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some downsides of thesauri like WordNet?

A
  • missing nuance - e.g., “proficient” is listed as a synonym of “good”, but it’s only correct in some contexts
  • missing new meaning of words - e.g., badass, wicked (slangs); impossible to keep up-to-date
  • requires human labor to curate and create
  • subjective
  • can’t compute accurate word similarity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is TF-IDF?

A
  • a common baseline model
  • represented by sparse vectors
  • words are represented by a simple function of the counts of nearby words
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Word2Vec?

A
  • represented by dense vectors

* representation is created by training a classifier to distinguish nearby and far-away words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define words by their usage from an NLP perspective.

P.S.: I don’t know how to formulate this question so just swipe if you don’t understand

A

Words are defined by their environments (the words around them)

If A and B have almost identical environments we say that they are synonyms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a term-document matrix?

A

Term-document matrix = Each row of the matrix is a document vector, with one column for every term in the entire corpus. The value in each cell of the matrix is the term frequency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a term-context matrix? (word-word matrix)

A

Term-context matrix = Each column is a context of a candidate word. Each element of the matrix holds the number of occurrences of the word of the vocabulary in the context.

So basically, row headers are words of interest, and column headers are words that appear in the contexts of those words of interest. Not all the words appear close to the words on interest, so some values are gonna be 0, or some values are gonna be high if a word is often seen around a word of interest (e.g., pie is often seen around apple, so the value might be 100)

And then using this matrix you can kinda figure out which words are similar to each other = if the row vectors look kinda the same. e.g., if the word “apricot” has been spotted around “sugar” 10 times and around “pie” 3 times, and then the word apple has been spotted around “sugar” 9 times and around pie 3 times, then we can say that the words are similar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you calculate the cosine similarity between two vectors? What is cosine similarity

A

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.

Cosine similarity formula is
the sum of the product of the number of times the two interest words are in the context of other words
divided by
the square root sum of the squared number of times interest word A is seen around every word in the matrix multiplied by the square root sum of the squared number of times interest word B is seen around every word in the matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is raw frequency not such a good measure of calculating representation or similarity?

A

Frequency is useful, e.g., if “sugar” appears a lot near “apricot” that is useful information
But overly frequent word life “the” or “it” are not very informative about the context
So, we need a function that resolves this frequency paradox

One solution is TF-IDF

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Distinguish between term frequency (TD) and inverse document frequency (IDF). How can you use them to compute similarity between words?

A

TF calculates the frequency count.
- has a value of 1+ log(10)count of term in the document, if count of term in the document > 0 | or 0 if count of term in the document is 0

IDF has a value of log( of number of docs in the collection divided by the number of docs that have the term)

TF-IDF value for term t in the document d is then equal to TF * IDF

You can compare two words using TD-IDF cosine to see if they are similar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

True or False?

The closer the cosine value to 1, the smaller the angle and the greater the match between vectors

A

True.

Why? because small angle means 0 degrees and cosine of 0 is 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is one major problem of TD-IDF?

A

TD-IDF vectors are long and sparse.

Instead, learn how to encode similarity in the vectors themselves => more dense and short vectors, where most of the values are non-zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why should we use dense vectors and not sparse?

A
  • short vectors may be easier to use as features in machine learning => less weights to tune
  • dense vectors may generalize better than storing explicit counts
  • they may do better at capturing synonyms
  • they work better in practice
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is distributional semantics?

A

A word’s meaning is given by the words that frequently appear close-by

When a word w appears in a text, its context is a set of words that appear nearby (within a fixed-size window)

Use the many contexts of w to build a representation of w

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you build a dense vector for a word?

A

It is chosen so that it is similar to the vectors of words that appear in similar contexts

17
Q

Instead of counting how many times the word “sugar” appears close to “apricot”, what does Word2Vec do?

A

It trains a classifier on a binary prediction task:

=> Is “sugar” likely to show up near “apricot”?

18
Q

Is Word2Vec a supervised or unsupervised model?

A

Unsupervised, because we do not need to label to asnwer the question “is w likely to show up near y”, we just take the learned classifier weights as the word vectors

we instead use a corpus that we have trained on it

19
Q

What is the difference between CBOW and Skip-gram?

A

Via Word2Vec

The CBOW architecture predicts the current word based on the context, while the Skip-gram predicts surrounding words (the context) given the current word

20
Q

Explain the skip-gram algorithm

A
  1. treat the target word and a neighboring context word as positive examples (withing some window, maybe +2, and -2)
  2. randomly sample other words in the lexicon to get negative samples
  3. use logistic regression to train a classifier to distinguish those two cases
  4. use the weights as the vectors
21
Q

How do you evaluate embeddings?

A
  • compare to human scores on word similarity type tasks by using different available corpora