Words and meaning Flashcards
Lemma
Is the base form of a word that is used to represent multiple inflected forms of the word.
Hyponym
One word is an hyponym of another if the first has a more specific sense.
Distributional semantics, basic idea and basic approach
Distributional semantics is a subfield of NLP that develops methods for quantifying semantic similarities between words based on their distributional properties, meaning their neighboring words.
- The BASIC IDEA lays in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.
- The BASIC APPROACH is to collect distributional information in high-dimensional vectors, and to define distributional/semantic similarity in terms of vector similarity.
Vector normalization and cosine similarity.
Vector normalization of v:
- compute the norm of v: |v|
- compute w = v/|v|
Cosine similarity:
cosine(v,w) = v w/|v||w|
Vector semantics and the 2 families of word embeddings
Is the approach of creating word embeddings and it is the standard way to represent word meaning in NLP.
Two families of word embeddings:
- Sparse vectors: vector components are computed through some function of the counts of nearby words.
- Dense vectors: vector components are computed through some optimisation or approximation process.
Term-context matrix
For each term, count how many times it appears in a fixed window context.
Pointwise mutual information (PMI)
Is a measure of how often two events x and y occur, compared with what we would expect if they were independent.
Give the generale definition…: I(x,y) = … first and then:
PMI between a target (term) word wt and a context word wc…: PMI(wt, wc) = …
- The numerator tells us how often we observed the two words together.
- The denominator tells us how often we would expect the two words to co-occur assuming they each occurred independently.
- The ratio gives us an estimate of how much more the two words co-occur than we expect by chance.
Probability estimation using word frequency, why PPMI?
- compute P(w)
- compute P(wt, wc) (careful to define C(wt, wc), remember the constant L)
- define the PPMI
Why PPMI?
- for estimating negative values we would need an enormous corpus
- If the words never occur together we have PMI = - inf
Exercize on computing PPMI term-context matrix
Slide 32-33 pdf 4…
We can use the rows of the PPMI term-context matrix as word embeddings.
Notice that these vectors:
- have the size of the vocabulary, which can be quite large
- when viewed as arrays, they are very sparse
Bias of the PMI
PMI has the problem of being biased toward infrequent events: very rare words tend to have very high PMI values.
One way to reduce this bias is to slightly change the computation for P(wc) in the PPMI (hint: use α).
Truncated singular value decomposition technique
Is a matrix approximation technique for obtaining dense word embeddings from the PPMI term-context matrix.
Let P be the matrix to approximate and U, V learnable parameters.
- min U,V ||P-P(U,V)||F (remember the Forbeus norm)
- min U,S,V ||P-USVt||F
U represents the target embeddings
S*Vt represents the context embeddings
Word2vec, idea and skip-gram with negative sampling
Word2vec is a software package including two different algorithms for learning word embeddings:
- skip-gram with negative sampling (SGNS)
- continuous bag-of-words (CBOW)
IDEA:
We train a classifier on the following binary prediction task:
- Is a given context word likely to appear near a given target word?
We don’t really care about this prediction task: instead, we use the learned parameters as the word embeddings.
Skip-gram algorithm
Static (each word has 1 word embedding) neural word embedding.
- For each target word wt in the vocabulary V:
- treat wt and any neighboring context word wc as positive examples
- randomly sample other words wn in V, called noise words, to produce negative examples for wt - use logistic regression to train a classifier to distinguish positive and negative examples
- use the learned weights as the embeddings
example slide 51 pdf 4…
Logistic regression in the skip-gram algorithm
- we need to estimate the probabilities P(+|wt, u) and P(-|wt,u)
- for each word w in V construct two complementary embeddings:
- target embeddings et(w)
- context embeddings ec(w) - define P(+|wt, u) = σ(et(w)*ec(u)) and P(-) = 1 - P(+)
Training in the skip-gram algorithm
For simplicity, let us consider a dataset with only one target/context pair (w, u) along with k noise words v1, v2, . . . , vk (negative examples).
Skip-gram makes the simplifying assumption that all (positive and negative) context words are independent.
-
Maximize LLw: log of [P(+) k products of P(-))]. After each update of the parameters we have:
- an increase in similarity (dot product) between et(w) and wc(u)
- a decrease in similarity between et(w) and ec(vi), for all of the noise words vi. - We train the model with stochastic gradient descent, as usual for logistic regression.
- We retain the target embeddings et(w) and ignore the context embeddings ec(w).