Words and meaning Flashcards

Question 1

Q

Lemma

Answer

A

Is the base form of a word that is used to represent multiple inflected forms of the word.

Question 2

Q

Hyponym

Answer

A

One word is an hyponym of another if the first has a more specific sense.

Question 3

Q

Distributional semantics, basic idea and basic approach

Answer

A

Distributional semantics is a subfield of NLP that develops methods for quantifying semantic similarities between words based on their distributional properties, meaning their neighboring words.

The BASIC IDEA lays in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.
The BASIC APPROACH is to collect distributional information in high-dimensional vectors, and to define distributional/semantic similarity in terms of vector similarity.

Question 4

Q

Vector normalization and cosine similarity.

Answer

A

Vector normalization of v:

compute the norm of v: |v|
compute w = v/|v|

Cosine similarity:

cosine(v,w) = v w/|v||w|

Question 5

Q

Vector semantics and the 2 families of word embeddings

Answer

A

Is the approach of creating word embeddings and it is the standard way to represent word meaning in NLP.

Two families of word embeddings:

Sparse vectors: vector components are computed through some function of the counts of nearby words.
Dense vectors: vector components are computed through some optimisation or approximation process.

Question 6

Q

Term-context matrix

Answer

A

For each term, count how many times it appears in a fixed window context.

Question 7

Q

Pointwise mutual information (PMI)

Answer

A

Is a measure of how often two events x and y occur, compared with what we would expect if they were independent.

Give the generale definition…: I(x,y) = … first and then:

PMI between a target (term) word wt and a context word wc…: PMI(wt, wc) = …

The numerator tells us how often we observed the two words together.
The denominator tells us how often we would expect the two words to co-occur assuming they each occurred independently.
The ratio gives us an estimate of how much more the two words co-occur than we expect by chance.

Question 8

Q

Probability estimation using word frequency, why PPMI?

Answer

A

compute P(w)
compute P(wt, wc) (careful to define C(wt, wc), remember the constant L)
define the PPMI

Why PPMI?

for estimating negative values we would need an enormous corpus
If the words never occur together we have PMI = - inf

Question 9

Q

Exercize on computing PPMI term-context matrix

Answer

A

Slide 32-33 pdf 4…

We can use the rows of the PPMI term-context matrix as word embeddings.

Notice that these vectors:

have the size of the vocabulary, which can be quite large
when viewed as arrays, they are very sparse

Question 10

Q

Bias of the PMI

Answer

A

PMI has the problem of being biased toward infrequent events: very rare words tend to have very high PMI values.

One way to reduce this bias is to slightly change the computation for P(wc) in the PPMI (hint: use α).

Question 11

Q

Truncated singular value decomposition technique

Answer

A

Is a matrix approximation technique for obtaining dense word embeddings from the PPMI term-context matrix.

Let P be the matrix to approximate and U, V learnable parameters.

min U,V ||P-P(U,V)||F (remember the Forbeus norm)
min U,S,V ||P-USVt||F

U represents the target embeddings
S*Vt represents the context embeddings

Question 12

Q

Word2vec, idea and skip-gram with negative sampling

Answer

A

Word2vec is a software package including two different algorithms for learning word embeddings:

skip-gram with negative sampling (SGNS)
continuous bag-of-words (CBOW)

IDEA:
We train a classifier on the following binary prediction task:

Is a given context word likely to appear near a given target word?

We don’t really care about this prediction task: instead, we use the learned parameters as the word embeddings.

Question 13

Q

Skip-gram algorithm

Answer

A

Static (each word has 1 word embedding) neural word embedding.

For each target word wt in the vocabulary V:
- treat wt and any neighboring context word wc as positive examples
- randomly sample other words wn in V, called noise words, to produce negative examples for wt
use logistic regression to train a classifier to distinguish positive and negative examples
use the learned weights as the embeddings

example slide 51 pdf 4…

Question 14

Q

Logistic regression in the skip-gram algorithm

Answer

A

we need to estimate the probabilities P(+|wt, u) and P(-|wt,u)
for each word w in V construct two complementary embeddings:
- target embeddings et(w)
- context embeddings ec(w)
define P(+|wt, u) = σ(et(w)*ec(u)) and P(-) = 1 - P(+)

Question 15

Q

Training in the skip-gram algorithm

Answer

A

For simplicity, let us consider a dataset with only one target/context pair (w, u) along with k noise words v1, v2, . . . , vk (negative examples).

Skip-gram makes the simplifying assumption that all (positive and negative) context words are independent.

Maximize LLw: log of [P(+) k products of P(-))]. After each update of the parameters we have:
- an increase in similarity (dot product) between et(w) and wc(u)
- a decrease in similarity between et(w) and ec(vi), for all of the noise words vi.
We train the model with stochastic gradient descent, as usual for logistic regression.
We retain the target embeddings et(w) and ignore the context embeddings ec(w).

Question 16

Q

Evaluation for word embeddings models

Answer

A

Extrinsic evaluation: uses the model to be evaluated in some end-to-end application (as for instance sentiment analysis, machine translation, etc.) and measures performance
Intrinsic evaluation: looks at performance of model in isolation, with respect to a given evaluation measure

The most common evaluation metric for embedding models is extrinsic evaluation on end-to-end tasks.