Word representations and text classification Flashcards

1
Q

Mention the two main challenges with using a naive approach of word representation, i.e. using a one-hot encoding for every word in the text corpus.

A
  • Dimension of the representation vector would be very high for natural vocabularies.
  • All vectors are equally spread (vector similarity does not represent semantic similarity)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is “embedding”?

A

Projecting the naive approach (one-hot encoding of words) to a static lower dimensional space. This is achieved with a linear transform i.e. Ax, where A is the linear transform and x is the one-hot encoded word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do we quantify word similarity?

A

Words with similar context (similar words appearing before and after our two words to quantify). These vectors should be close to each other. This is done by optimizing the probabilities of each word in relation to our “center” word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explicitly note the softmax function

A

exp(z(i)) / sum_k(exp(z(k)))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the skip-gram model?

A

Input center word, output probability of context word (assuming context window = 2).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the continuous bag of words (CBOW) model?

A

Input context words, output probability of center word. Is CBOW and skip-gram equivalent when context window = 2?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Training large word-to-vector representational models are highly computationally expensive since we have to compute gradients of the softmax on the output with dimension equal to the vocabulary size (a couple of 100k). Mention three approaches reducing this computational cost.

A
  • Hierarchical Softmax (tree structure, hard to create)
  • Noise Contrastive Estimation (E_NCE)
  • Negative sampling (E_NEG/NSL)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain conceptually how Noise contrastive estimation works

A

The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.

For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.

This is important: instead of predicting the next word (the “standard” training technique), the optimized classifier simply predicts whether a pair of words is good or bad.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When are NCE and NEG/NSL equivalent?

A

When the noise distribution is the uniform distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explicitly note the sigmoid function

A

1 / (1 + exp(-x))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Mention some other embedding levels except for word embeddings

A
  • Character level embedding
  • Sentence level embedding
  • Universal embedding (incorporate higher level information)
  • Supervised learning with syntactic/semantic supervision
How well did you know this?
1
Not at all
2
3
4
5
Perfectly