W5 L2 distrubutional semantics Flashcards
watch a video on what feedforward neural networks are
yuh
watch a video on binary logsitc regression as a one layer network
yuh
watch video
Multinomial Logistic Regression as a
1-layer Network
watch a video on softmax
yuh
watch a video on two layer network with scalar output
what are word emeddings
short vectors with dimenisons ranging from 50-1000
its faster because vectors are dense
what is word2vec
two architectrures
continous bag of words model
and contiuous skip gram model
in a software package called word2vec
its for word emedding
what do you train a classifier to solve in word2vec
you trian it to solve the binary task
is word c likely to show up near apicot
c is the correct answer to the question
what are we actually interested in in w2v skip gram neg sampling
not the answer to the question is word c likely to show up near apricot but
the weights the classifer learns in order to answer this question
these weights are used in the embedding for the target word appricot
what is the producure for word2 vec
In essence, word2vec performs binary classification training a logistic regression classifier. Here is the
procedure that the skip-gram model follows:
- It treats the target word w and a neighbouring context word c as positive examples.
- Then, it randomly samples other words in the vocabulary and uses them as negative samples.
- Next, it uses logistic regression to train a classifier aimed at distinguishing between positive and
negative samples. - Finally, it uses the learned weights as the word embeddings
what is pca
principal component analysis applies a linear transform to the data so those dimensions which have the most variance (principle componets) can be easily identified
how can we find principle componets
we can use singular value decomposition to find principal components
what is the problem with using singular value decpomosition to find principle components
its expensive
instead of using expnesive singular value decomposition what can we do
we can use word2vec
what is the goal of the classiffer in word2vec
given a tuple (w, c) for the target word w paired with a
candidate context word c (e.g., (apricot, jam) and (apricot, aardvark)), return the probability that c is a real
context word (which should return True for jam, and False for aardvark). The probability for the actual
context word c is defined as:
P(+|w,c)
the chance that c is not a real context of the target word is
P(-|w,c) = 1 - P(+|w,c)
how does the word2vec classfier estimate the probability p
the skip-gram model uses
embeddings similarity: a context word c is likely to occur near the target word w if its embedding is similar
to the target embedding. To compute this similarity, the skip-gram model uses dot product
similar(c,w) = c * w
The higher the dot product, the more similar the two embeddings are (note that cosine is a length normalised dot product).
What simplifying assumption does the skip-gram model make?
The skip-gram model assumes context words are independent, similar to assumptions in Naïve Bayes and Markov models. This allows the model to multiply individual probabilities.
What does the skip-gram model do?
It trains a probabilistic classifier that, given a target word
𝑤 and a context window of
𝐿 words
𝑐1:𝐿
c 1:L, assigns a probability based on how similar each context word is to the target word.
How does the skip-gram model calculate probabilities?
It applies the logistic (sigmoid) function to the dot product of the target word’s embedding and each context word’s embedding.
What embeddings does the skip-gram model learn?
The model learns:
Target word embeddings.
Context word embeddings.
Each word has two embeddings: one for when it’s a target word and another for when it’s a context word.
What are the two matrices learned in the skip-gram model?
W: Contains embeddings for target words.
C: Contains embeddings for context words.
what are the two emdeddings word2vec stores per word
one as target word and one as context word
what is the process for learning embeddings
start with random initalization for the word emeddings
iterartivly update emeddings so that the embedding for the word w gets closer to the embedding of words that occur nearby and further wawy from the emebddings of words that do not co-occcur
what are negative samples in skip gram
Since the model trains a binary classifier, apart from positive it also needs some negative instances. The
skip-gram model uses negative sampling, which means that it randomly selects a number of negative
examples with the ratio of negative-to-positive examples defined by the parameter k: in other words, for
each (w, cpos) training instance, the algorithm randomly selects k “noise” words from the vocabulary,
What is the goal of the skip-gram algorithm?
: To adjust embeddings so that:
Similarity of (w, c_pos) pairs (target and real context words) is maximized.
Similarity of (w, c_neg) pairs (target and negative/noise words) is minimized.
What does the skip-gram loss function (
L
CE ) represent?
The loss function has two terms:
First term: Maximizes the probability of the real context word
c pos
being a neighbor of w.
Second term: Minimizes the probability of negative words
c
neg
being neighbors of
w.
Why are the loss function terms multiplied?
The skip-gram model assumes word independence, so probabilities of context words can be multiplied
How does the skip-gram model evaluate similarity?
By the dot product of the target word embedding
𝑤
w with:
c
pos
(real context words) to maximize similarity.
c
neg
(negative samples) to minimize similarity.
How is the skip-gram loss function minimized?
Using stochastic gradient descent (SGD), which is commonly used for optimizing models like logistic regression and neural networks.
What does the skip-gram learning process aim to achieve in one step?
It adjusts embeddings to bring the target word
w closer to its true neighbors
c
pos
and farther from noise words
c
neg
What is a core assumption in building word embeddings?
Word embeddings can capture semantic similarity between words, making it essential to evaluate them on this ability.
How is semantic similarity tested without context?
By comparing word similarity scores (e.g., cosine similarity of word vectors) with human-assigned ratings using datasets like:
WordSim-353: 353 noun pairs with similarity ratings (e.g., “plane, car” = 5.77).
SimLex-999: Adjective, noun, and verb pairs with ratings.
TOEFL dataset: 80 questions with a target word and 4 choices, one synonymous
How is semantic similarity tested in context?
Using datasets like:
SCWS: Stanford Contextual Word Similarity.
WiC: Word-in-Context dataset.
These test models on human judgments of word use in sentential contexts.
What is the semantic textual similarity task?
It focuses on sentence-level similarity, where pairs of sentences are labeled with human-assigned similarity scores.
How is compositionality tested in semantic representations?
By evaluating their ability to perform paraphrasing tasks, such as:
Fantasy world → Fairyland.
Dog house → Kennel.
Fish bowl → Aquarium.
How is relational meaning tested in semantic models?
Using analogy tasks like:
“Apple is to tree as grape is to ___ (vine).”
Solved with vector arithmetic:
𝑏 ∗ =𝑎 ∗ − 𝑎 +𝑏
where the closest vector to the result is selected.
How does Word2Vec solve analogy tasks
By using vector arithmetic, such as:
“King - Man + Woman” → Queen.
“Queen - King + Kings” → Queens.
These demonstrate linguistic regularities in word embeddings