Lecture 10 - Word Vectors and Vector Semantics Flashcards

Question 1

Q

What is denotational semantics?

Answer

A

Denotational semantics is an approach to formalizing the meanings of programming languages by constructing mathematical objects (called denotations) that describe the meanings of expressions from the languages.

Question 2

Q

What is the most common solution to get usable meaning from a text that can be processed by a computer?

Answer

A

Using thesauri like WordNet, which contains list of synonym sets and hyponyms (“is a” relationships)

Question 3

Q

What are some downsides of thesauri like WordNet?

Answer

A

missing nuance - e.g., “proficient” is listed as a synonym of “good”, but it’s only correct in some contexts
missing new meaning of words - e.g., badass, wicked (slangs); impossible to keep up-to-date
requires human labor to curate and create
subjective
can’t compute accurate word similarity

Question 4

Q

What is TF-IDF?

Answer

A

a common baseline model
represented by sparse vectors
words are represented by a simple function of the counts of nearby words

Question 5

Q

What is Word2Vec?

Answer

A

represented by dense vectors

* representation is created by training a classifier to distinguish nearby and far-away words

Question 6

Q

Define words by their usage from an NLP perspective.

P.S.: I don’t know how to formulate this question so just swipe if you don’t understand

Answer

A

Words are defined by their environments (the words around them)

If A and B have almost identical environments we say that they are synonyms

Question 7

Q

What is a term-document matrix?

Answer

A

Term-document matrix = Each row of the matrix is a document vector, with one column for every term in the entire corpus. The value in each cell of the matrix is the term frequency.

Question 8

Q

What is a term-context matrix? (word-word matrix)

Answer

A

Term-context matrix = Each column is a context of a candidate word. Each element of the matrix holds the number of occurrences of the word of the vocabulary in the context.

So basically, row headers are words of interest, and column headers are words that appear in the contexts of those words of interest. Not all the words appear close to the words on interest, so some values are gonna be 0, or some values are gonna be high if a word is often seen around a word of interest (e.g., pie is often seen around apple, so the value might be 100)

And then using this matrix you can kinda figure out which words are similar to each other = if the row vectors look kinda the same. e.g., if the word “apricot” has been spotted around “sugar” 10 times and around “pie” 3 times, and then the word apple has been spotted around “sugar” 9 times and around pie 3 times, then we can say that the words are similar

Question 9

Q

How do you calculate the cosine similarity between two vectors? What is cosine similarity

Answer

A

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.

Cosine similarity formula is
the sum of the product of the number of times the two interest words are in the context of other words
divided by
the square root sum of the squared number of times interest word A is seen around every word in the matrix multiplied by the square root sum of the squared number of times interest word B is seen around every word in the matrix

Question 10

Q

Why is raw frequency not such a good measure of calculating representation or similarity?

Answer

A

Frequency is useful, e.g., if “sugar” appears a lot near “apricot” that is useful information
But overly frequent word life “the” or “it” are not very informative about the context
So, we need a function that resolves this frequency paradox

One solution is TF-IDF

Question 11

Q

Distinguish between term frequency (TD) and inverse document frequency (IDF). How can you use them to compute similarity between words?

Answer

A

TF calculates the frequency count.
- has a value of 1+ log(10)count of term in the document, if count of term in the document > 0 | or 0 if count of term in the document is 0

IDF has a value of log( of number of docs in the collection divided by the number of docs that have the term)

TF-IDF value for term t in the document d is then equal to TF * IDF

You can compare two words using TD-IDF cosine to see if they are similar

Question 12

Q

True or False?

The closer the cosine value to 1, the smaller the angle and the greater the match between vectors

Answer

A

True.

Why? because small angle means 0 degrees and cosine of 0 is 1

Question 13

Q

What is one major problem of TD-IDF?

Answer

A

TD-IDF vectors are long and sparse.

Instead, learn how to encode similarity in the vectors themselves => more dense and short vectors, where most of the values are non-zero

Question 14

Q

Why should we use dense vectors and not sparse?

Answer

A

short vectors may be easier to use as features in machine learning => less weights to tune
dense vectors may generalize better than storing explicit counts
they may do better at capturing synonyms
they work better in practice

Question 15

Q

What is distributional semantics?

Answer

A

A word’s meaning is given by the words that frequently appear close-by

When a word w appears in a text, its context is a set of words that appear nearby (within a fixed-size window)

Use the many contexts of w to build a representation of w

Question 16

Q

How do you build a dense vector for a word?

Answer

A

It is chosen so that it is similar to the vectors of words that appear in similar contexts

Question 17

Q

Instead of counting how many times the word “sugar” appears close to “apricot”, what does Word2Vec do?

Answer

A

It trains a classifier on a binary prediction task:

=> Is “sugar” likely to show up near “apricot”?

Question 18

Q

Is Word2Vec a supervised or unsupervised model?

Answer

A

Unsupervised, because we do not need to label to asnwer the question “is w likely to show up near y”, we just take the learned classifier weights as the word vectors

we instead use a corpus that we have trained on it

Question 19

Q

What is the difference between CBOW and Skip-gram?

Answer

A

Via Word2Vec

The CBOW architecture predicts the current word based on the context, while the Skip-gram predicts surrounding words (the context) given the current word

Question 20

Q

Explain the skip-gram algorithm

Answer

A

treat the target word and a neighboring context word as positive examples (withing some window, maybe +2, and -2)
randomly sample other words in the lexicon to get negative samples
use logistic regression to train a classifier to distinguish those two cases
use the weights as the vectors

Question 21

Q

How do you evaluate embeddings?

Answer

A

compare to human scores on word similarity type tasks by using different available corpora