w8l2 Flashcards
what is the core idea of beam search
keep track of the k most probabl partial translations
k is beam size (around 5, 10)
beam search is not gaurneteed to find optimal solution but its efficent
we search for high scoring hypotheses
what is beam search stoping criteria
wait until eos (end of sentnece is hit)
or until we reach preestablished time t
what is attention input
hi…. hn
all encorder hidden states
decoder hidden state at time step
what is attention scores
score(s t,h k) k = 1…N
how relevant is source token k for target step t
what are attention weights
we softmax a function so that you get a probabiltiy distrbution over these numbers
so it will add up to one
attention output
weighted sum.
You take the probability output for a particular input h k multiplied with that h you.
And you do that for every single hidden layer and you add it up sorry hidden state.
And you add that up. And that is the output, uh, that represents that attention
what are some ways to compute attention score
dot product attention
multiplictaive attention
additive attention
what inspired the transformer
we need an atchtecture that provides contextual emeddings
camptues semantic and sunatic information like rnns
can process a sentence in paraelle
and is cheap per layer
seq2seq without attention uses what to proccess within encoder and decoder and decorder encoder interaction
rnn rnn static fixed sizdd vector repsectily
seq2seq with attention uses what to proccess within encoder and decoder and decorder encoder interaction
rnn rnn attention respectivly
what does self attention consist of
Query (q): vector from which the attention
is looking
● Key (k): vector at which the query looks to
establish context
● Value (v): value of word being looked at,
weighted based on context