Word Embeddings Flashcards

1
Q

How does a count based word embedding differ from the term-term matrix of distributional semantics?

A

we use a term-document matrix
the meaning of a word is then
[count(u,d1), count(u,d2)…]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what do we get if we read a term-document matrix column wise

A

represents a document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what do we get if we read a term-document matrix row wise

A

represents a word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

why do we use tf-idf

A

some words are very common but arent important. this allows us to perform weighting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

how do we calculate tf

A

log_10( 1 + count(t,d)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

how do we calculate idf

A

log_10(N / df(t))
N is number of documents
df(t) is the number of documents where t appears

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

how do we calculate tf idf

A

tf * idf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is laplace smoothing and why do we use it

A

add 1 to every entry as a pseudo count

this ensures that if a word pair occurs >0 then it does not become 0 due to sparsity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what are the benefits of embedding words using term-document matrix (3)

A

simple and intuitive
dimensions are meaningful
easy to debug and interpret

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what are the drawbacks of embedding words using term-document matrix (2)

A

sparse- there may be lots of terms and documents

word meaning may be different in a particular context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Formalise the challenge of knowing the meaning of a word within its contect

A

we have words represented by vectors
we have the probability of a word within context Pr(w | u1, u2 …)

our task is to find a vector for a word that will maximise pr(w| …) within its context

the loss function is
the sum for each word and context in train: log(p(w|u1….)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does the bengio model work?

A

Predict the next word using m-1 previous words:

p(Wt | Wt-1 …. Wt-m)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does the CBOW model work?

A

Predict a word from the context of m words before and m words after

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Give the architecture of the cbow model

A

multiple input nodes- one for each word in window
take the average
apply softmax

to get p(w|context)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Give the architecture of the Bengio model

A

multiple input nodes- one for each word in window
concat
tanh
softmax

to get p(w|context)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does the skipgram model work?

A

Predict the context of m words before and m words after given a target word

17
Q

Give the architecture of the skipgram model

A

w(t) as input
softmax
multiple input nodes- one for each word in window as output

18
Q

How can we go from cbow to skipgram

A

bayes

19
Q

What is Word2Vec?

A

A skipgram model that can capture linear relational meanings like analogy and shows semantic change over time

20
Q

What are the problems with Word embeddings? (3)

A
  • bias
  • unknowns
  • can only capture one context
21
Q

How can we capture more than one context of a word?

A
  • make f(w,c) a continuous function, like elmo or bert

- make f(w,c) discrete with respect to c. Word sense disambiguation