Word Embeddings Flashcards
How does a count based word embedding differ from the term-term matrix of distributional semantics?
we use a term-document matrix
the meaning of a word is then
[count(u,d1), count(u,d2)…]
what do we get if we read a term-document matrix column wise
represents a document
what do we get if we read a term-document matrix row wise
represents a word
why do we use tf-idf
some words are very common but arent important. this allows us to perform weighting
how do we calculate tf
log_10( 1 + count(t,d)
how do we calculate idf
log_10(N / df(t))
N is number of documents
df(t) is the number of documents where t appears
how do we calculate tf idf
tf * idf
what is laplace smoothing and why do we use it
add 1 to every entry as a pseudo count
this ensures that if a word pair occurs >0 then it does not become 0 due to sparsity
what are the benefits of embedding words using term-document matrix (3)
simple and intuitive
dimensions are meaningful
easy to debug and interpret
what are the drawbacks of embedding words using term-document matrix (2)
sparse- there may be lots of terms and documents
word meaning may be different in a particular context
Formalise the challenge of knowing the meaning of a word within its contect
we have words represented by vectors
we have the probability of a word within context Pr(w | u1, u2 …)
our task is to find a vector for a word that will maximise pr(w| …) within its context
the loss function is
the sum for each word and context in train: log(p(w|u1….)
How does the bengio model work?
Predict the next word using m-1 previous words:
p(Wt | Wt-1 …. Wt-m)
How does the CBOW model work?
Predict a word from the context of m words before and m words after
Give the architecture of the cbow model
multiple input nodes- one for each word in window
take the average
apply softmax
to get p(w|context)
Give the architecture of the Bengio model
multiple input nodes- one for each word in window
concat
tanh
softmax
to get p(w|context)