Word Embeddings Flashcards

Question 1

Q

How does a count based word embedding differ from the term-term matrix of distributional semantics?

Answer

A

we use a term-document matrix
the meaning of a word is then
[count(u,d1), count(u,d2)…]

Question 2

Q

what do we get if we read a term-document matrix column wise

Answer

A

represents a document

Question 3

Q

what do we get if we read a term-document matrix row wise

Answer

A

represents a word

Question 4

Q

why do we use tf-idf

Answer

A

some words are very common but arent important. this allows us to perform weighting

Question 5

Q

how do we calculate tf

Answer

A

log_10( 1 + count(t,d)

Question 6

Q

how do we calculate idf

Answer

A

log_10(N / df(t))
N is number of documents
df(t) is the number of documents where t appears

Question 7

Q

how do we calculate tf idf

Question 8

Q

what is laplace smoothing and why do we use it

Answer

A

add 1 to every entry as a pseudo count

this ensures that if a word pair occurs >0 then it does not become 0 due to sparsity

Question 9

Q

what are the benefits of embedding words using term-document matrix (3)

Answer

A

simple and intuitive
dimensions are meaningful
easy to debug and interpret

Question 10

Q

what are the drawbacks of embedding words using term-document matrix (2)

Answer

A

sparse- there may be lots of terms and documents

word meaning may be different in a particular context

Question 11

Q

Formalise the challenge of knowing the meaning of a word within its contect

Answer

A

we have words represented by vectors
we have the probability of a word within context Pr(w | u1, u2 …)

our task is to find a vector for a word that will maximise pr(w| …) within its context

the loss function is
the sum for each word and context in train: log(p(w|u1….)

Question 12

Q

How does the bengio model work?

Answer

A

Predict the next word using m-1 previous words:

p(Wt | Wt-1 …. Wt-m)

Question 13

Q

How does the CBOW model work?

Answer

A

Predict a word from the context of m words before and m words after

Question 14

Q

Give the architecture of the cbow model

Answer

A

multiple input nodes- one for each word in window
take the average
apply softmax

to get p(w|context)

Question 15

Q

Give the architecture of the Bengio model

Answer

A

multiple input nodes- one for each word in window
concat
tanh
softmax

to get p(w|context)

Question 16

Q

How does the skipgram model work?

Answer

Study These Flashcards

A

Predict the context of m words before and m words after given a target word

Question 17

Q

Give the architecture of the skipgram model

Answer

Study These Flashcards

A

w(t) as input
softmax
multiple input nodes- one for each word in window as output

Question 18

Q

How can we go from cbow to skipgram

Answer

Study These Flashcards

A

bayes

Question 19

Q

What is Word2Vec?

Answer

Study These Flashcards

A

A skipgram model that can capture linear relational meanings like analogy and shows semantic change over time

Question 20

Q

What are the problems with Word embeddings? (3)

Answer

Study These Flashcards

A

bias
unknowns
can only capture one context

Question 21

Q

How can we capture more than one context of a word?

Answer

Study These Flashcards

A

make f(w,c) a continuous function, like elmo or bert

- make f(w,c) discrete with respect to c. Word sense disambiguation

Word Embeddings Flashcards

(21 cards)