Week 8 - seq2seq learning Flashcards
What is the task of seq2seg
transforming one sequence (input) into another sequence (output)
do not have to have same length
What is the task of MT
take a sequence written in one language and translate it into another sequence in another language
input: source language sequence
output: target language sequence
What is greedy decoding
Use in deep learning approach for MT
chooses the token with the highest probability from softmax distribution
at each timestep t, decoder is choosing the apparent local optima
however, this is not necessarily the globally optimal token
What is the search tree in MT
graphical representation of the choices made by a greedy decoder
each branch has the probability score, greedy decoder will choose the highest score branch to follow
the actual global path can be calculated from multiplying probabilities along path -> highest result
What is Beam search
Solution to greedy search
selects k possible tokens at each timestep
eg k =2; only follows the 2 highest probability branches
what is k in beam search
beam width parameter
How does beam search work
1) select k best options (hypotheses) EG “the” “arrived”
2) pass each hypotheses (as input) through decoder and softmax to obtain softmax over the next possible tokens
3) score each hypotheses
4) based on these scores, select next k best options for next timestep EG “witch” “green” …2 more
5) pass these next options into decoder with previous eg. “the witch” “the green” to obtain softmax
6) repeat until a </s> is generated then k is reduced and search continues until k=0
branches that are not selected are terminated
What is the probability of a (partial) translation
At each time step, add the log probability of the translation so far to log of probability of next token
(sum of log probabilities)
log P(y3|y2,y1,x)
note: log can have negative values
MT transformer-based encoder-decoder
In the encoder
Uses Vaswani transformer blocks
allows access to the left and right tokens (of the current)
In the decoder
uses an additional cross attention layer
causal self attention - can only access tokens to the left
What does the cross attention layer allow (MT)
(in decoder)
Allows decoder to attend to each of the source language tokens as projected into the final layer of the encoder
What is BLEU
BiLingual Evaluation Understudy
- precision based metric that uses word overlap
- calculated for each translated sequence (averaged over a corpus to report overall performance)
Why is BLEU better than precision
Precision will give a perfect score to
reference: the cat is on the mat
hypothesis: the the the the the the
because the hypothesis does not contain any words that are not in the reference
How does BLEU work
For each word
Takes the minimum between how many times the word appears in ref vs hyp
sums all these minimums
divide by total number of words in the hypothesis
eg Hypothesis = the the the the the the
reference = the cat is on the mat
BLUE = min(6,2)/6 = 1/3
BLEU-N
based on n-grams
1) generate n-grams for each of the hyp and ref
2) for each n-gram get the minimum (count[hyp], count[ref])
3) BLEU-N = sum of the minimum values for each n-gram/number of n-grams in hypothesis
eg HYP = the cat is here
so bigrams = “the cat” “cat is” “is here”
NOTE only calculate over all hyp n-grams (not ref)
What is chrF
Character F-score
- based on a function of the number of character n-gram overlaps between a hypothesis and reference translation
- uses a parameter k = maximum length of character n-grams to be considered
eg k=2 means we are interested in unigrams and bigrams only
How to calculate chrP
ratio of 1 to k-grams in the hypothesis that occur in the reference, averaged