Lecture 10 - Word Vectors and Vector Semantics Flashcards
What is denotational semantics?
Denotational semantics is an approach to formalizing the meanings of programming languages by constructing mathematical objects (called denotations) that describe the meanings of expressions from the languages.
What is the most common solution to get usable meaning from a text that can be processed by a computer?
Using thesauri like WordNet, which contains list of synonym sets and hyponyms (“is a” relationships)
What are some downsides of thesauri like WordNet?
- missing nuance - e.g., “proficient” is listed as a synonym of “good”, but it’s only correct in some contexts
- missing new meaning of words - e.g., badass, wicked (slangs); impossible to keep up-to-date
- requires human labor to curate and create
- subjective
- can’t compute accurate word similarity
What is TF-IDF?
- a common baseline model
- represented by sparse vectors
- words are represented by a simple function of the counts of nearby words
What is Word2Vec?
- represented by dense vectors
* representation is created by training a classifier to distinguish nearby and far-away words
Define words by their usage from an NLP perspective.
P.S.: I don’t know how to formulate this question so just swipe if you don’t understand
Words are defined by their environments (the words around them)
If A and B have almost identical environments we say that they are synonyms
What is a term-document matrix?
Term-document matrix = Each row of the matrix is a document vector, with one column for every term in the entire corpus. The value in each cell of the matrix is the term frequency.
What is a term-context matrix? (word-word matrix)
Term-context matrix = Each column is a context of a candidate word. Each element of the matrix holds the number of occurrences of the word of the vocabulary in the context.
So basically, row headers are words of interest, and column headers are words that appear in the contexts of those words of interest. Not all the words appear close to the words on interest, so some values are gonna be 0, or some values are gonna be high if a word is often seen around a word of interest (e.g., pie is often seen around apple, so the value might be 100)
And then using this matrix you can kinda figure out which words are similar to each other = if the row vectors look kinda the same. e.g., if the word “apricot” has been spotted around “sugar” 10 times and around “pie” 3 times, and then the word apple has been spotted around “sugar” 9 times and around pie 3 times, then we can say that the words are similar
How do you calculate the cosine similarity between two vectors? What is cosine similarity
Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.
Cosine similarity formula is
the sum of the product of the number of times the two interest words are in the context of other words
divided by
the square root sum of the squared number of times interest word A is seen around every word in the matrix multiplied by the square root sum of the squared number of times interest word B is seen around every word in the matrix
Why is raw frequency not such a good measure of calculating representation or similarity?
Frequency is useful, e.g., if “sugar” appears a lot near “apricot” that is useful information
But overly frequent word life “the” or “it” are not very informative about the context
So, we need a function that resolves this frequency paradox
One solution is TF-IDF
Distinguish between term frequency (TD) and inverse document frequency (IDF). How can you use them to compute similarity between words?
TF calculates the frequency count.
- has a value of 1+ log(10)count of term in the document, if count of term in the document > 0 | or 0 if count of term in the document is 0
IDF has a value of log( of number of docs in the collection divided by the number of docs that have the term)
TF-IDF value for term t in the document d is then equal to TF * IDF
You can compare two words using TD-IDF cosine to see if they are similar
True or False?
The closer the cosine value to 1, the smaller the angle and the greater the match between vectors
True.
Why? because small angle means 0 degrees and cosine of 0 is 1
What is one major problem of TD-IDF?
TD-IDF vectors are long and sparse.
Instead, learn how to encode similarity in the vectors themselves => more dense and short vectors, where most of the values are non-zero
Why should we use dense vectors and not sparse?
- short vectors may be easier to use as features in machine learning => less weights to tune
- dense vectors may generalize better than storing explicit counts
- they may do better at capturing synonyms
- they work better in practice
What is distributional semantics?
A word’s meaning is given by the words that frequently appear close-by
When a word w appears in a text, its context is a set of words that appear nearby (within a fixed-size window)
Use the many contexts of w to build a representation of w