Word2Vec Flashcards
Is Word2Vec a dense or sparse embedding?
Dense
Are dense or sparse embeddings better?
Dense embeddings work better - not sure why
What does static embedding mean?
It means that the words are single embeddings that do not change
How does contextual embedding differ from static embedding?
The vector for the word will vary depending on its context
What does self-supervision mean?
It means that a large tagged dataset is not needed
What does Word2Vec do in simple terms?
Trains a binary classifier to predict if Word A is likely to show up near word B - we then use the embedding weights
What model is used by Word2Vec?
The skip-gram model
What is the skip-gram model?
It takes a set of target words and neighbouring context words (positive examples), random samples are taken to create negative examples, logistic regression is used to train classifier and the learned weights provide the embeddings
What does a skip-gram model store?
A target embedding for matrix W for target words and a context embedding matrix C for context and noise words
How do we avoid a bias towards common words when selecting noise words?
We use a weighted unigram sample frequency to select the noisy words
How are Word2Vec embeddings learnt?
They are learnt by minimising a loss function using stochastic gradient descent (SGD)
What does the loss function do in Word2Vec?
Aims to maximise the probability that the target word is close to the positive examples and maximise the probability that the target word is not close to the negative examples
Visually, what are we trying to do with Word2Vec?
We are trying to move the weights such that there is an increased association between the target word and positive examples, and a decrease when we have the negative examples
How are the target matrix and context matrix initialised?
They are randomly initialised, typically with Gaussian noise
How is the word embedding matrix initialised?
Target Matrix (W) + Context Matrix (C)