Vector Semantics and Embeddings Flashcards
Distributional hypothesis
Words that occur in similar contexts tend to have similar meanings.
The hypothesis was formulated in the 1950s by Joos, Harris and Firth, who noticed that words which are synonyms tended to occur in the same environment with the amount of meaning difference between two words “corresponding to the amount of difference in their environments”.
Vector semantics
Vector semantics instantiates the distributional hypothesis by learning representations of the meaning of words, called embeddings, directly from their distributions in texts.
Representation learning
Self-supervised learning, where useful representations of the input text are automatically learned, instead of crafting representation by hand using feature engineering.
Lexical symantics
The linguistic study of word meaning
Propositional meaning
Two words are synonymous if they are substitutable for one another in any sentence without changing the truth conditions of the sentence - the situations in which the sentence would be true.
Principle of contrast
A difference in linguistic form is always associated with some difference in meaning.
E.g. H₂O and water are synonymous. But H₂O is used in scientific contexts and would be inappropriate in a hiking guide.
Semantic field
A set of words which cover a particular semantic domain and bear structured relations with each other.
E.g. the semantic field of hospitals (surgeon, scalpel, nurse, anesthetic, hospital), restaurants (waiter, menu, plate, food, chef).
Semantic frame
A set of words that denote perspectives or participants in a particular type of event.
E.g. a commercial transation is a kind of event in which one entity trades money to another entity in return for some good or service, after which the good changes hands or perhaps the service is performed.
This event can be encoded lexically by using verbs like buy (the event from the perspective of the buyer), sell (from the persepctive of the seller), pay (focussing on the monetary aspect), or nouns like buyer.
Frames have semantic roles (buyer, seller, goods, money) and words in a sentence can take on these roles.
Connotations
Words have affective meanings.
The aspects of a word’s meaning that are related to a writer or reader’s emotions, sentiment, opinions or evaluations.
Sentiment
Positive or negative evaluation language.
Vectors semantics
The standard way to represent word meaning in NLP.
The idea is to represent a word as a point in a multidimensional semantic space that is derived from the distributions of word neighbours.
Embeddings
Vectors for representing words.
Co-occurrence matrix
A way of representing how often words co-occur.
Term-document matrix
Each row represents a word in the vocabulary and each column represents a document from some collection of documents.
Each cell represents the number of times a particular word occurs in a particular document.
Information retrieval
The task of finding the document d
from the D documents in some collection that best matches a query q
.
term-term matrix
A matrix of dimensionality |V| x |V|, where each cell records the number of times the row word and the column word co-occur in some context in some training corpus.
The context could be the document. However, it is common to use smaller contexts, generally a window around the word, e.g. 4 words to the left and 4 words to the right.
Cosine similarity
cosine(v
, w
) = v
· w
/ (|v|
|w|
)
Value ranges from -1 to 1.
But since raw frequency values are non-negative, cosine similarity for these vectors ranges from 0-1.
tf-idf weighting
A product of two terms: term frequency and inverse document frequency
w(t
, d
) = tf( t
, d
) x idf (t
)
tf-idf
term frequency
The frequency of the word t
in the document d
.
tf(t
, d
) = count(t
, d
)
Commonly a log weighting is used:
tf(t
, d
) = log₁₀ ( count(t
, d
) + 1)
tf-idf
inverse document frequency
The document frequency of a term t
is the number of documents it occurs in. df(t
)
Inverse document frequency, idf
, where N
is the total number of documents in the collection:
idf(t
) = N / df(t
)
Commonly a log weighting is used:
idf(t
) = log₁₀ ( N / df(t
) )
The fewer documents a term occurs in, the higher this weight.
Positive Pointwise Mutual Information
Intuition
The best way to weight the association between two words is to ask how much more two words co-occur in our corpus than we would have a priori expected them to appear by chance.
Pointwise Mutual Information
A measure of how often two events x
and y
occur, compared with what we would expect if they were independent:
I(x, y) = log₂ P(x, y) / (P(x) P(y)
The Pointwise mutual information between a target word w
and a context word c
is then defined as:
PMI( w
, c
) = log₂ P(w
, c
) / (P(w
) P(c
))
The numerator tells us how often we observed the two words together (assuming we compute probability by using the MLE).
The denominator tells us how often we would expect the two words to co-occur assuming they each occurred independently.
Word2Vec
Intuition of skip-gram
- Treat the target word and a neighbouring context word as positive examples
- Randomly sample other words in the lexicon to get negative samples.
- Use logistic regression to train a classifier to distinguish these two cases.
- Use the learned weights as the embeddings.
First-order co-occurrence
A.k.a syntagmatic association
Two words have first-order co-occurrence if they are typically nearby each other.
.e.g “wrote” is a first-order associate of “book” or “poem”
Paradigmatic association
Second-order co-occurrence
Two words have second-order co-occurrence if they have similar neighbours.
Wrote is a second-order associate of words like “said” or “remarked”