Word2Vec Flashcards
Is Word2Vec a dense or sparse embedding?
Dense
Are dense or sparse embeddings better?
Dense embeddings work better - not sure why
What does static embedding mean?
It means that the words are single embeddings that do not change
How does contextual embedding differ from static embedding?
The vector for the word will vary depending on its context
What does self-supervision mean?
It means that a large tagged dataset is not needed
What does Word2Vec do in simple terms?
Trains a binary classifier to predict if Word A is likely to show up near word B - we then use the embedding weights
What model is used by Word2Vec?
The skip-gram model
What is the skip-gram model?
It takes a set of target words and neighbouring context words (positive examples), random samples are taken to create negative examples, logistic regression is used to train classifier and the learned weights provide the embeddings
What does a skip-gram model store?
A target embedding for matrix W for target words and a context embedding matrix C for context and noise words
How do we avoid a bias towards common words when selecting noise words?
We use a weighted unigram sample frequency to select the noisy words
How are Word2Vec embeddings learnt?
They are learnt by minimising a loss function using stochastic gradient descent (SGD)
What does the loss function do in Word2Vec?
Aims to maximise the probability that the target word is close to the positive examples and maximise the probability that the target word is not close to the negative examples
Visually, what are we trying to do with Word2Vec?
We are trying to move the weights such that there is an increased association between the target word and positive examples, and a decrease when we have the negative examples
How are the target matrix and context matrix initialised?
They are randomly initialised, typically with Gaussian noise
How is the word embedding matrix initialised?
Target Matrix (W) + Context Matrix (C)
What are some other static embeddings?
Fasttext and Global Vectors (GloVe)
What words appear in a context window when it is small?
Words that are similar, such as those within a list
What words appear in a context window when it is larger?
Words that are more associated rather than similar, so it will capture longer distance topical relationships
What do analogies mean in relation to Word2Vec?
It looks at how A is to B as C is to …
How are analogies computed using Word2Vec?
We compute the vector that takes you from the embedding space from A to B, and apply the offset to word C and find what words are similar
When using Word2Vec with analogies, what needs to be excluded?
Morphological variants of the target word (i.e We don’t want potato → potato | potatoes, but we do want potato → brown)
What are the problems with bias and embeddings?
Bias of the time will be included, which can be problematic when we consider how the world constantly changes
What is allocation harm?
it is where bias in algorithms result in unfair real world outcomes (e.g. credit check that results in denial due to some underlying bias)
What is bias amplification?
It is where embeddings exaggerate patterns making encodings even more bias than the original training resource. Implicit bias can be captured (racial, sexist, ageist bias)
What is representational harm?
It is where harm is caused by a system demeaning or ignoring some social groups
What is debiasing?
It is a way of manipulating embeddings to remove unwelcome stereotypes, it may reduce bias but will not eliminate it
When do two words have first-order co-occurrence?
When they are typically near each other
When do two words have second-order co-occurrence?
When they have similar neighbours