Week 8 - seq2seq learning Flashcards

1
Q

What is the task of seq2seg

A

transforming one sequence (input) into another sequence (output)
do not have to have same length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the task of MT

A

take a sequence written in one language and translate it into another sequence in another language
input: source language sequence
output: target language sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is greedy decoding

A

Use in deep learning approach for MT
chooses the token with the highest probability from softmax distribution
at each timestep t, decoder is choosing the apparent local optima
however, this is not necessarily the globally optimal token

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the search tree in MT

A

graphical representation of the choices made by a greedy decoder
each branch has the probability score, greedy decoder will choose the highest score branch to follow
the actual global path can be calculated from multiplying probabilities along path -> highest result

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Beam search

A

Solution to greedy search
selects k possible tokens at each timestep
eg k =2; only follows the 2 highest probability branches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is k in beam search

A

beam width parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does beam search work

A

1) select k best options (hypotheses) EG “the” “arrived”
2) pass each hypotheses (as input) through decoder and softmax to obtain softmax over the next possible tokens
3) score each hypotheses
4) based on these scores, select next k best options for next timestep EG “witch” “green” …2 more
5) pass these next options into decoder with previous eg. “the witch” “the green” to obtain softmax
6) repeat until a </s> is generated then k is reduced and search continues until k=0

branches that are not selected are terminated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the probability of a (partial) translation

A

At each time step, add the log probability of the translation so far to log of probability of next token
(sum of log probabilities)
log P(y3|y2,y1,x)
note: log can have negative values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

MT transformer-based encoder-decoder

A

In the encoder
Uses Vaswani transformer blocks
allows access to the left and right tokens (of the current)

In the decoder
uses an additional cross attention layer
causal self attention - can only access tokens to the left

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the cross attention layer allow (MT)

A

(in decoder)
Allows decoder to attend to each of the source language tokens as projected into the final layer of the encoder

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is BLEU

A

BiLingual Evaluation Understudy
- precision based metric that uses word overlap
- calculated for each translated sequence (averaged over a corpus to report overall performance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is BLEU better than precision

A

Precision will give a perfect score to
reference: the cat is on the mat
hypothesis: the the the the the the
because the hypothesis does not contain any words that are not in the reference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does BLEU work

A

For each word
Takes the minimum between how many times the word appears in ref vs hyp
sums all these minimums
divide by total number of words in the hypothesis

eg Hypothesis = the the the the the the
reference = the cat is on the mat
BLUE = min(6,2)/6 = 1/3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

BLEU-N

A

based on n-grams
1) generate n-grams for each of the hyp and ref
2) for each n-gram get the minimum (count[hyp], count[ref])
3) BLEU-N = sum of the minimum values for each n-gram/number of n-grams in hypothesis

eg HYP = the cat is here
so bigrams = “the cat” “cat is” “is here”

NOTE only calculate over all hyp n-grams (not ref)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is chrF

A

Character F-score
- based on a function of the number of character n-gram overlaps between a hypothesis and reference translation
- uses a parameter k = maximum length of character n-grams to be considered

eg k=2 means we are interested in unigrams and bigrams only

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to calculate chrP

A

ratio of 1 to k-grams in the hypothesis that occur in the reference, averaged

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How to calculate chrR

A

ratio of 1 to k-grams in the reference that occur in the hypothesis, averaged

18
Q

How to calculate chrFβ

A

1) Remove spaces from both ref and hyp
2) count number of matching k-grams
3) plug into chrR and chrP

β is typically 2
chrFβ = ( 1 +β)² [chrP x chrR / β² x chrP +chrR]

19
Q

What does chrF2,3 notation mean

A

β = 2
K = 3

20
Q

What is the task of ATS

A
  • produce a summary of a full-length document
  • input sequence: full-length text (source)
  • output sequence: summarised text (target)
21
Q

4 types of ATS

A

input: single or multi doc

language: mono-, multi, cross-lingual

learning: supervised or unsupervised

generation: extractive vs abstractive

22
Q

What are the statistical approaches of Extractive ATS

A

word-frequency based
tf idf based

23
Q

what are machine learning based approaches to extractive ATS

A

binary classifiers
grpah-based

24
Q

what does the decoder act as in ATS

A

extractor

25
Q

How does attentional encoder-decoder RNN for ATS work

A

Has the generic encoder and decoder mechanism Attention layer from encoder hidden states to each decoder state
Feature-rich encoder: linguistic features combined with embeddings
switching generator/pointer model

26
Q

what is the purpose of a feature rich encoder

A

improves the performance of text summarisation by capturing linguistic features

27
Q

What is the switching generator/pointer model

A

issue with ats: there will be words in test data that the model has not seen before

At any point during decoding, compute a probability (switch) - decides whether to generate or to point

Instead of emitting <UNK> for OOV words, model points to the word's position in the input document</UNK>

28
Q

How do transformers work for ATS

A

repurposing the training for LMs (teacher forcing)
(predicting next word)

During inference (after training)
the source (full length doc), with separator token δ appended is used as input
target is generated in an autoregressive manner

model has access to the source as well as tokens in the target generated so far

29
Q

What is ROUGE

A

Automatic evaluation for summaries
Recall-Oriented Understudy for Gisting Evaluation

Counts the number of overlapping units (ie n-grams) between the generated (candidate) and reference summaries

30
Q

How to calculate ROUGE-N

A

n-gram recall
Σ count(match) gram / Σcount gram

(Σcount gram = number of n-grams in the candidate)

31
Q

What is ROUGE-L

A

The Longest common subsequence (LCS)- based F score

subsequences are ordered but can have gaps (words in between)

assuming X and Y are the reference and candidate summaries, with lengths m and n tokens respectively

precision, recall -> f score

32
Q

How to calculate Rlcs

A

LCS (X,Y) / m

m = length of reference summary

33
Q

How to calculate Plcs

A

LCS (X,Y) / n

n = length of candidate summary

34
Q

How to calculate Flcs

A

(1+ β²) Rlcs Plcs / Rlcs + β²Plcs

(β = 1 for F1 score)

35
Q

What is ROUGE-S

A

based on skip-bigram co-occurrences

assuming again X,Y = reference summary, candidate summary

again calculate precision and recall -> fscore

36
Q

How to calculate Rskip2

A

SKIP2(X,Y) / C(m,2)

(matching co occurring skip bigrams in X and Y divided by total skip bigrams in reference)

37
Q

How to calculate Pskip2

A

SKIP2(X,Y) / C(n,2)

38
Q

How to calculate Fskip2

A

(1+ β²) Rskip2 Pskip2 / Rskip2 + β²Pskip2

39
Q

What are skip bigrams

A

includes pairs that skip one word
“the quick brown fox”

example skip bigram = “the brown”, “the fox”
still includes no skips = “brown fox”

40
Q

ROUGE-S vs ROUGE-L

A

Rouge-l does not handle the case of multiple long subsequences (only finds longest)
rouge-s can handle this better

41
Q

What is autoregressive generation

A

generation of output sequentially, where each output token depends on the previously generated tokens
eg in ATS, in MT, in attention
everytime the decoder output is sent back in as input