W5 L2 distrubutional semantics Flashcards
watch a video on what feedforward neural networks are
yuh
watch a video on binary logsitc regression as a one layer network
yuh
watch video
Multinomial Logistic Regression as a
1-layer Network
watch a video on softmax
yuh
watch a video on two layer network with scalar output
what are word emeddings
short vectors with dimenisons ranging from 50-1000
its faster because vectors are dense
what is word2vec
two architectrures
continous bag of words model
and contiuous skip gram model
in a software package called word2vec
its for word emedding
what do you train a classifier to solve in word2vec
you trian it to solve the binary task
is word c likely to show up near apicot
c is the correct answer to the question
what are we actually interested in in w2v skip gram neg sampling
not the answer to the question is word c likely to show up near apricot but
the weights the classifer learns in order to answer this question
these weights are used in the embedding for the target word appricot
what is the producure for word2 vec
In essence, word2vec performs binary classification training a logistic regression classifier. Here is the
procedure that the skip-gram model follows:
- It treats the target word w and a neighbouring context word c as positive examples.
- Then, it randomly samples other words in the vocabulary and uses them as negative samples.
- Next, it uses logistic regression to train a classifier aimed at distinguishing between positive and
negative samples. - Finally, it uses the learned weights as the word embeddings
what is pca
principal component analysis applies a linear transform to the data so those dimensions which have the most variance (principle componets) can be easily identified
how can we find principle componets
we can use singular value decomposition to find principal components
what is the problem with using singular value decpomosition to find principle components
its expensive
instead of using expnesive singular value decomposition what can we do
we can use word2vec
what is the goal of the classiffer in word2vec
given a tuple (w, c) for the target word w paired with a
candidate context word c (e.g., (apricot, jam) and (apricot, aardvark)), return the probability that c is a real
context word (which should return True for jam, and False for aardvark). The probability for the actual
context word c is defined as:
P(+|w,c)
the chance that c is not a real context of the target word is
P(-|w,c) = 1 - P(+|w,c)
how does the word2vec classfier estimate the probability p
the skip-gram model uses
embeddings similarity: a context word c is likely to occur near the target word w if its embedding is similar
to the target embedding. To compute this similarity, the skip-gram model uses dot product
similar(c,w) = c * w
The higher the dot product, the more similar the two embeddings are (note that cosine is a length normalised dot product).
What simplifying assumption does the skip-gram model make?
The skip-gram model assumes context words are independent, similar to assumptions in Naรฏve Bayes and Markov models. This allows the model to multiply individual probabilities.
What does the skip-gram model do?
It trains a probabilistic classifier that, given a target word
๐ค and a context window of
๐ฟ words
๐1:๐ฟ
c 1:L, assigns a probability based on how similar each context word is to the target word.
How does the skip-gram model calculate probabilities?
It applies the logistic (sigmoid) function to the dot product of the target wordโs embedding and each context wordโs embedding.
What embeddings does the skip-gram model learn?
The model learns:
Target word embeddings.
Context word embeddings.
Each word has two embeddings: one for when itโs a target word and another for when itโs a context word.
What are the two matrices learned in the skip-gram model?
W: Contains embeddings for target words.
C: Contains embeddings for context words.
what are the two emdeddings word2vec stores per word
one as target word and one as context word
what is the process for learning embeddings
start with random initalization for the word emeddings
iterartivly update emeddings so that the embedding for the word w gets closer to the embedding of words that occur nearby and further wawy from the emebddings of words that do not co-occcur
what are negative samples in skip gram
Since the model trains a binary classifier, apart from positive it also needs some negative instances. The
skip-gram model uses negative sampling, which means that it randomly selects a number of negative
examples with the ratio of negative-to-positive examples defined by the parameter k: in other words, for
each (w, cpos) training instance, the algorithm randomly selects k โnoiseโ words from the vocabulary,