Words and meaning Flashcards

1
Q

Lemma

A

Is the base form of a word that is used to represent multiple inflected forms of the word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Hyponym

A

One word is an hyponym of another if the first has a more specific sense.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Distributional semantics, basic idea and basic approach

A

Distributional semantics is a subfield of NLP that develops methods for quantifying semantic similarities between words based on their distributional properties, meaning their neighboring words.

  • The BASIC IDEA lays in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.
  • The BASIC APPROACH is to collect distributional information in high-dimensional vectors, and to define distributional/semantic similarity in terms of vector similarity.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Vector normalization and cosine similarity.

A

Vector normalization of v:

  1. compute the norm of v: |v|
  2. compute w = v/|v|

Cosine similarity:

cosine(v,w) = v w/|v||w|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Vector semantics and the 2 families of word embeddings

A

Is the approach of creating word embeddings and it is the standard way to represent word meaning in NLP.

Two families of word embeddings:

  • Sparse vectors: vector components are computed through some function of the counts of nearby words.
  • Dense vectors: vector components are computed through some optimisation or approximation process.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Term-context matrix

A

For each term, count how many times it appears in a fixed window context.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Pointwise mutual information (PMI)

A

Is a measure of how often two events x and y occur, compared with what we would expect if they were independent.

Give the generale definition…: I(x,y) = … first and then:

PMI between a target (term) word wt and a context word wc…: PMI(wt, wc) = …

  • The numerator tells us how often we observed the two words together.
  • The denominator tells us how often we would expect the two words to co-occur assuming they each occurred independently.
  • The ratio gives us an estimate of how much more the two words co-occur than we expect by chance.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Probability estimation using word frequency, why PPMI?

A
  1. compute P(w)
  2. compute P(wt, wc) (careful to define C(wt, wc), remember the constant L)
  3. define the PPMI

Why PPMI?

  • for estimating negative values we would need an enormous corpus
  • If the words never occur together we have PMI = - inf
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Exercize on computing PPMI term-context matrix

A

Slide 32-33 pdf 4…

We can use the rows of the PPMI term-context matrix as word embeddings.

Notice that these vectors:

  • have the size of the vocabulary, which can be quite large
  • when viewed as arrays, they are very sparse
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Bias of the PMI

A

PMI has the problem of being biased toward infrequent events: very rare words tend to have very high PMI values.

One way to reduce this bias is to slightly change the computation for P(wc) in the PPMI (hint: use α).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Truncated singular value decomposition technique

A

Is a matrix approximation technique for obtaining dense word embeddings from the PPMI term-context matrix.

Let P be the matrix to approximate and U, V learnable parameters.

  1. min U,V ||P-P(U,V)||F (remember the Forbeus norm)
  2. min U,S,V ||P-USVt||F

U represents the target embeddings
S*Vt represents the context embeddings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Word2vec, idea and skip-gram with negative sampling

A

Word2vec is a software package including two different algorithms for learning word embeddings:

  • skip-gram with negative sampling (SGNS)
  • continuous bag-of-words (CBOW)

IDEA:
We train a classifier on the following binary prediction task:

  • Is a given context word likely to appear near a given target word?

We don’t really care about this prediction task: instead, we use the learned parameters as the word embeddings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Skip-gram algorithm

A

Static (each word has 1 word embedding) neural word embedding.

  1. For each target word wt in the vocabulary V:
    - treat wt and any neighboring context word wc as positive examples
    - randomly sample other words wn in V, called noise words, to produce negative examples for wt
  2. use logistic regression to train a classifier to distinguish positive and negative examples
  3. use the learned weights as the embeddings

example slide 51 pdf 4…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Logistic regression in the skip-gram algorithm

A
  1. we need to estimate the probabilities P(+|wt, u) and P(-|wt,u)
  2. for each word w in V construct two complementary embeddings:
    - target embeddings et(w)
    - context embeddings ec(w)
  3. define P(+|wt, u) = σ(et(w)*ec(u)) and P(-) = 1 - P(+)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Training in the skip-gram algorithm

A

For simplicity, let us consider a dataset with only one target/context pair (w, u) along with k noise words v1, v2, . . . , vk (negative examples).

Skip-gram makes the simplifying assumption that all (positive and negative) context words are independent.

  1. Maximize LLw: log of [P(+) k products of P(-))]. After each update of the parameters we have:
    - an increase in similarity (dot product) between et(w) and wc(u)
    - a decrease in similarity between et(w) and ec(vi), for all of the noise words vi.
  2. We train the model with stochastic gradient descent, as usual for logistic regression.
  3. We retain the target embeddings et(w) and ignore the context embeddings ec(w).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Evaluation for word embeddings models

A
  • Extrinsic evaluation: uses the model to be evaluated in some end-to-end application (as for instance sentiment analysis, machine translation, etc.) and measures performance
  • Intrinsic evaluation: looks at performance of model in isolation, with respect to a given evaluation measure

The most common evaluation metric for embedding models is extrinsic evaluation on end-to-end tasks.