2. Word Embeddings Flashcards
Describe the Bag-of-Words (BoW) representation.
We create a zero-filled array of length(vocabulary) where every index corresponds to a word, and set the index to 1 in places where we have a word in our sentence.
Give some ways of representing words.
- Words themselves
- Mapping to Lexical Database
- Co-occurence vectors
- Features of words
- Word Embeddings
Give an example of mapping of words to concepts. What are the issues with this representation?
Cat & kitten → feline mammal
Mouse & rat → rodent mammal
Issues:
Misses nuances, new meanings, hard problem to find mapping, resources are scarce.
Give an example of features of words representations. What are the issues with this representation?
Example:
Counting positive/negative words in sentence.
Good movie, great actors, bad music: 2+, 1-
Issues:
It is task specific and requires knowledge of our domain.
What is a word embedding?
Word Embedding is representing (embedding) words in a continuous vector space where semantically similar words are mapped to nearby points (i.e. are embedded nearby each other)
Describe count-based embeddings.
We create a symmetric matrix filled with zeros, and place 1s for every cell where the words appear together in a sentence. The resulting vectors would show the relationship between different words by taking a metric like the L2 distance between them.
What are stop words?
Stop words refer to the most common words in a language. We usually opt to remove them during the pre-processing stage of our data.
Describe learning word representations during training.
○ Modelling similarity between words in corpora
○ Instead of counting co-occurrences, predict context
words in context
○ Computationally efficient
○ Possible to add new words to the model: it scales with
corpus size
What is the difference between a Continuous Bag of Words (CBoW) model and a Skip-gram model?
CBOW: Predicts target word w_t from context words.
Skip-gram: Predicts context words from target word w_t
What is the goal of the Skip-gram model? Give its formula and define its probability.
The goal of the skip-gram model is to find word representations that are useful for predicting context words w_t+j given a word w_t by maximising the average log probability for a context window c.
(1/T) * Sum_t Sum_{-c<=j<=c, j != 0} log p(w_t+j | w_t ; θ)
Where the probability p is the softmax where the numerator is the exp() of the dot product between the target word and the context word, and the denominator is the sum over all words in the corpus of the exp() of the dot product between all the words and the target word.
How can we get a matrix of word embeddings of our input?
We set up a neural network with one hidden layer with no activation function and one output layer. We then train the network using one-hot encodings of our words to predict the probability of its context words at the output layer. Once this is trained, we dispose of the output layer, and the remaining hidden layer is our matrix of word embeddings, corresponding to our words.
What is negative sampling?
Training a skip-gram neural network has a tremendous number of weights, all of which would be updated slightly by every one of our billions of training samples.
Negative sampling addresses this by having each training sample only modify a small percentage of the weights, rather than all of them.
Instead of calculating each context word’s weight, we randomly choose a number of negative sample words (not related to the target word), and only update the weights for them and the target word (positive).
5-20 words works well for smaller datasets, and you can get away with only 2-5 words for large datasets.
What is Glove (Global Vectors for Word Representation)?
Glove is a ‘library’ of pre-trained word vectors which we can use in our deep learning models as embeddings.