2. Word Embeddings Flashcards

1
Q

Describe the Bag-of-Words (BoW) representation.

A

We create a zero-filled array of length(vocabulary) where every index corresponds to a word, and set the index to 1 in places where we have a word in our sentence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Give some ways of representing words.

A
  • Words themselves
  • Mapping to Lexical Database
  • Co-occurence vectors
  • Features of words
  • Word Embeddings
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Give an example of mapping of words to concepts. What are the issues with this representation?

A

Cat & kitten → feline mammal
Mouse & rat → rodent mammal

Issues:
Misses nuances, new meanings, hard problem to find mapping, resources are scarce.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Give an example of features of words representations. What are the issues with this representation?

A

Example:
Counting positive/negative words in sentence.
Good movie, great actors, bad music: 2+, 1-

Issues:
It is task specific and requires knowledge of our domain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a word embedding?

A

Word Embedding is representing (embedding) words in a continuous vector space where semantically similar words are mapped to nearby points (i.e. are embedded nearby each other)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe count-based embeddings.

A

We create a symmetric matrix filled with zeros, and place 1s for every cell where the words appear together in a sentence. The resulting vectors would show the relationship between different words by taking a metric like the L2 distance between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are stop words?

A

Stop words refer to the most common words in a language. We usually opt to remove them during the pre-processing stage of our data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe learning word representations during training.

A

○ Modelling similarity between words in corpora
○ Instead of counting co-occurrences, predict context
words in context
○ Computationally efficient
○ Possible to add new words to the model: it scales with
corpus size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the difference between a Continuous Bag of Words (CBoW) model and a Skip-gram model?

A

CBOW: Predicts target word w_t from context words.

Skip-gram: Predicts context words from target word w_t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the goal of the Skip-gram model? Give its formula and define its probability.

A

The goal of the skip-gram model is to find word representations that are useful for predicting context words w_t+j given a word w_t by maximising the average log probability for a context window c.

(1/T) * Sum_t Sum_{-c<=j<=c, j != 0} log p(w_t+j | w_t ; θ)

Where the probability p is the softmax where the numerator is the exp() of the dot product between the target word and the context word, and the denominator is the sum over all words in the corpus of the exp() of the dot product between all the words and the target word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can we get a matrix of word embeddings of our input?

A

We set up a neural network with one hidden layer with no activation function and one output layer. We then train the network using one-hot encodings of our words to predict the probability of its context words at the output layer. Once this is trained, we dispose of the output layer, and the remaining hidden layer is our matrix of word embeddings, corresponding to our words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is negative sampling?

A

Training a skip-gram neural network has a tremendous number of weights, all of which would be updated slightly by every one of our billions of training samples.

Negative sampling addresses this by having each training sample only modify a small percentage of the weights, rather than all of them.

Instead of calculating each context word’s weight, we randomly choose a number of negative sample words (not related to the target word), and only update the weights for them and the target word (positive).

5-20 words works well for smaller datasets, and you can get away with only 2-5 words for large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Glove (Global Vectors for Word Representation)?

A

Glove is a ‘library’ of pre-trained word vectors which we can use in our deep learning models as embeddings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly