Word representations and text classification Flashcards

Question 1

Q

Mention the two main challenges with using a naive approach of word representation, i.e. using a one-hot encoding for every word in the text corpus.

Answer

A

Dimension of the representation vector would be very high for natural vocabularies.
All vectors are equally spread (vector similarity does not represent semantic similarity)

Question 2

Q

What is “embedding”?

Answer

A

Projecting the naive approach (one-hot encoding of words) to a static lower dimensional space. This is achieved with a linear transform i.e. Ax, where A is the linear transform and x is the one-hot encoded word.

Question 3

Q

How do we quantify word similarity?

Answer

A

Words with similar context (similar words appearing before and after our two words to quantify). These vectors should be close to each other. This is done by optimizing the probabilities of each word in relation to our “center” word.

Question 4

Q

Explicitly note the softmax function

Answer

A

exp(z(i)) / sum_k(exp(z(k)))

Question 5

Q

What is the skip-gram model?

Answer

A

Input center word, output probability of context word (assuming context window = 2).

Question 6

Q

What is the continuous bag of words (CBOW) model?

Answer

A

Input context words, output probability of center word. Is CBOW and skip-gram equivalent when context window = 2?

Question 7

Q

Training large word-to-vector representational models are highly computationally expensive since we have to compute gradients of the softmax on the output with dimension equal to the vocabulary size (a couple of 100k). Mention three approaches reducing this computational cost.

Answer

A

Hierarchical Softmax (tree structure, hard to create)
Noise Contrastive Estimation (E_NCE)
Negative sampling (E_NEG/NSL)

Question 8

Q

Explain conceptually how Noise contrastive estimation works

Answer

A

The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.

For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.

This is important: instead of predicting the next word (the “standard” training technique), the optimized classifier simply predicts whether a pair of words is good or bad.

Question 9

Q

When are NCE and NEG/NSL equivalent?

Answer

A

When the noise distribution is the uniform distribution.

Question 10

Q

Explicitly note the sigmoid function

Answer

A

1 / (1 + exp(-x))

Question 11

Q

Mention some other embedding levels except for word embeddings

Answer

A

Character level embedding
Sentence level embedding
Universal embedding (incorporate higher level information)
Supervised learning with syntactic/semantic supervision

Word representations and text classification Flashcards

(11 cards)