W5 L2 distrubutional semantics Flashcards by Anne Jiao

watch a video on what feedforward neural networks are

yuh

How well did you know this?

Not at all

Perfectly

watch a video on binary logsitc regression as a one layer network

yuh

How well did you know this?

Not at all

Perfectly

watch video

Multinomial Logistic Regression as a
1-layer Network

How well did you know this?

Not at all

Perfectly

watch a video on softmax

yuh

How well did you know this?

Not at all

Perfectly

watch a video on two layer network with scalar output

How well did you know this?

Not at all

Perfectly

what are word emeddings

short vectors with dimenisons ranging from 50-1000

its faster because vectors are dense

How well did you know this?

Not at all

Perfectly

what is word2vec

two architectrures
continous bag of words model
and contiuous skip gram model
in a software package called word2vec

its for word emedding

How well did you know this?

Not at all

Perfectly

what do you train a classifier to solve in word2vec

you trian it to solve the binary task

is word c likely to show up near apicot

c is the correct answer to the question

How well did you know this?

Not at all

Perfectly

what are we actually interested in in w2v skip gram neg sampling

not the answer to the question is word c likely to show up near apricot but

the weights the classifer learns in order to answer this question

these weights are used in the embedding for the target word appricot

How well did you know this?

Not at all

Perfectly

what is the producure for word2 vec

In essence, word2vec performs binary classification training a logistic regression classifier. Here is the
procedure that the skip-gram model follows:

It treats the target word w and a neighbouring context word c as positive examples.
Then, it randomly samples other words in the vocabulary and uses them as negative samples.
Next, it uses logistic regression to train a classifier aimed at distinguishing between positive and
negative samples.
Finally, it uses the learned weights as the word embeddings

How well did you know this?

Not at all

Perfectly

what is pca

principal component analysis applies a linear transform to the data so those dimensions which have the most variance (principle componets) can be easily identified

How well did you know this?

Not at all

Perfectly

how can we find principle componets

we can use singular value decomposition to find principal components

How well did you know this?

Not at all

Perfectly

what is the problem with using singular value decpomosition to find principle components

its expensive

How well did you know this?

Not at all

Perfectly

instead of using expnesive singular value decomposition what can we do

we can use word2vec

How well did you know this?

Not at all

Perfectly

what is the goal of the classiffer in word2vec

given a tuple (w, c) for the target word w paired with a
candidate context word c (e.g., (apricot, jam) and (apricot, aardvark)), return the probability that c is a real
context word (which should return True for jam, and False for aardvark). The probability for the actual
context word c is defined as:

P(+|w,c)

the chance that c is not a real context of the target word is
P(-|w,c) = 1 - P(+|w,c)

How well did you know this?

Not at all

Perfectly

how does the word2vec classfier estimate the probability p

Study These Flashcards

the skip-gram model uses
embeddings similarity: a context word c is likely to occur near the target word w if its embedding is similar
to the target embedding. To compute this similarity, the skip-gram model uses dot product

similar(c,w) = c * w

The higher the dot product, the more similar the two embeddings are (note that cosine is a length normalised dot product).

What simplifying assumption does the skip-gram model make?

Study These Flashcards

The skip-gram model assumes context words are independent, similar to assumptions in Naïve Bayes and Markov models. This allows the model to multiply individual probabilities.

What does the skip-gram model do?

Study These Flashcards

It trains a probabilistic classifier that, given a target word
𝑤 and a context window of
𝐿 words
𝑐1:𝐿
c 1:L, assigns a probability based on how similar each context word is to the target word.

How does the skip-gram model calculate probabilities?

Study These Flashcards

It applies the logistic (sigmoid) function to the dot product of the target word’s embedding and each context word’s embedding.

What embeddings does the skip-gram model learn?

Study These Flashcards

The model learns:

Target word embeddings.
Context word embeddings.

Each word has two embeddings: one for when it’s a target word and another for when it’s a context word.

What are the two matrices learned in the skip-gram model?

Study These Flashcards

W: Contains embeddings for target words.
C: Contains embeddings for context words.

what are the two emdeddings word2vec stores per word

Study These Flashcards

one as target word and one as context word

what is the process for learning embeddings

Study These Flashcards

start with random initalization for the word emeddings

iterartivly update emeddings so that the embedding for the word w gets closer to the embedding of words that occur nearby and further wawy from the emebddings of words that do not co-occcur

what are negative samples in skip gram

Study These Flashcards

Since the model trains a binary classifier, apart from positive it also needs some negative instances. The
skip-gram model uses negative sampling, which means that it randomly selects a number of negative
examples with the ratio of negative-to-positive examples defined by the parameter k: in other words, for
each (w, cpos) training instance, the algorithm randomly selects k “noise” words from the vocabulary,

What is the goal of the skip-gram algorithm?

: To adjust embeddings so that: Similarity of (w, c_pos) pairs (target and real context words) is maximized. Similarity of (w, c_neg) pairs (target and negative/noise words) is minimized.

What does the skip-gram loss function ( L CE ) represent?

The loss function has two terms: First term: Maximizes the probability of the real context word c pos being a neighbor of w. Second term: Minimizes the probability of negative words c neg being neighbors of w.

Why are the loss function terms multiplied?

The skip-gram model assumes word independence, so probabilities of context words can be multiplied

How does the skip-gram model evaluate similarity?

By the dot product of the target word embedding 𝑤 w with: c pos (real context words) to maximize similarity. c neg (negative samples) to minimize similarity.

How is the skip-gram loss function minimized?

Using stochastic gradient descent (SGD), which is commonly used for optimizing models like logistic regression and neural networks.

What does the skip-gram learning process aim to achieve in one step?

It adjusts embeddings to bring the target word w closer to its true neighbors c pos and farther from noise words c neg

What is a core assumption in building word embeddings?

Word embeddings can capture semantic similarity between words, making it essential to evaluate them on this ability.

How is semantic similarity tested without context?

By comparing word similarity scores (e.g., cosine similarity of word vectors) with human-assigned ratings using datasets like: WordSim-353: 353 noun pairs with similarity ratings (e.g., "plane, car" = 5.77). SimLex-999: Adjective, noun, and verb pairs with ratings. TOEFL dataset: 80 questions with a target word and 4 choices, one synonymous

How is semantic similarity tested in context?

Using datasets like: SCWS: Stanford Contextual Word Similarity. WiC: Word-in-Context dataset. These test models on human judgments of word use in sentential contexts.

What is the semantic textual similarity task?

It focuses on sentence-level similarity, where pairs of sentences are labeled with human-assigned similarity scores.

How is compositionality tested in semantic representations?

By evaluating their ability to perform paraphrasing tasks, such as: Fantasy world → Fairyland. Dog house → Kennel. Fish bowl → Aquarium.

How is relational meaning tested in semantic models?

Using analogy tasks like: "Apple is to tree as grape is to ___ (vine)." Solved with vector arithmetic: 𝑏 ∗ =𝑎 ∗ − 𝑎 +𝑏 where the closest vector to the result is selected.

How does Word2Vec solve analogy tasks

By using vector arithmetic, such as: "King - Man + Woman" → Queen. "Queen - King + Kings" → Queens. These demonstrate linguistic regularities in word embeddings

W5 L2 distrubutional semantics Flashcards

(37 cards)