Week 8 - seq2seq learning Flashcards

Question 1

Q

What is the task of seq2seg

Answer

A

transforming one sequence (input) into another sequence (output)
do not have to have same length

Question 2

Q

What is the task of MT

Answer

A

take a sequence written in one language and translate it into another sequence in another language
input: source language sequence
output: target language sequence

Question 3

Q

What is greedy decoding

Answer

A

Use in deep learning approach for MT
chooses the token with the highest probability from softmax distribution
at each timestep t, decoder is choosing the apparent local optima
however, this is not necessarily the globally optimal token

Question 4

Q

What is the search tree in MT

Answer

A

graphical representation of the choices made by a greedy decoder
each branch has the probability score, greedy decoder will choose the highest score branch to follow
the actual global path can be calculated from multiplying probabilities along path -> highest result

Question 5

Q

What is Beam search

Answer

A

Solution to greedy search
selects k possible tokens at each timestep
eg k =2; only follows the 2 highest probability branches

Question 6

Q

what is k in beam search

Answer

A

beam width parameter

Question 7

Q

How does beam search work

Answer

A

1) select k best options (hypotheses) EG “the” “arrived”
2) pass each hypotheses (as input) through decoder and softmax to obtain softmax over the next possible tokens
3) score each hypotheses
4) based on these scores, select next k best options for next timestep EG “witch” “green” …2 more
5) pass these next options into decoder with previous eg. “the witch” “the green” to obtain softmax
6) repeat until a </s> is generated then k is reduced and search continues until k=0

branches that are not selected are terminated

Question 8

Q

What is the probability of a (partial) translation

Answer

A

At each time step, add the log probability of the translation so far to log of probability of next token
(sum of log probabilities)
log P(y3|y2,y1,x)
note: log can have negative values

Question 9

Q

MT transformer-based encoder-decoder

Answer

A

In the encoder
Uses Vaswani transformer blocks
allows access to the left and right tokens (of the current)

In the decoder
uses an additional cross attention layer
causal self attention - can only access tokens to the left

Question 10

Q

What does the cross attention layer allow (MT)

Answer

A

(in decoder)
Allows decoder to attend to each of the source language tokens as projected into the final layer of the encoder

Question 11

Q

What is BLEU

Answer

A

BiLingual Evaluation Understudy
- precision based metric that uses word overlap
- calculated for each translated sequence (averaged over a corpus to report overall performance)

Question 12

Q

Why is BLEU better than precision

Answer

A

Precision will give a perfect score to
reference: the cat is on the mat
hypothesis: the the the the the the
because the hypothesis does not contain any words that are not in the reference

Question 13

Q

How does BLEU work

Answer

A

For each word
Takes the minimum between how many times the word appears in ref vs hyp
sums all these minimums
divide by total number of words in the hypothesis

eg Hypothesis = the the the the the the
reference = the cat is on the mat
BLUE = min(6,2)/6 = 1/3

Question 14

Q

BLEU-N

Answer

A

based on n-grams
1) generate n-grams for each of the hyp and ref
2) for each n-gram get the minimum (count[hyp], count[ref])
3) BLEU-N = sum of the minimum values for each n-gram/number of n-grams in hypothesis

eg HYP = the cat is here
so bigrams = “the cat” “cat is” “is here”

NOTE only calculate over all hyp n-grams (not ref)

Question 15

Q

What is chrF

Answer

A

Character F-score
- based on a function of the number of character n-gram overlaps between a hypothesis and reference translation
- uses a parameter k = maximum length of character n-grams to be considered

eg k=2 means we are interested in unigrams and bigrams only

Question 16

Q

How to calculate chrP

Answer

A

ratio of 1 to k-grams in the hypothesis that occur in the reference, averaged

Question 17

Q

How to calculate chrR

Answer

A

ratio of 1 to k-grams in the reference that occur in the hypothesis, averaged

Question 18

Q

How to calculate chrFβ

Answer

A

1) Remove spaces from both ref and hyp
2) count number of matching k-grams
3) plug into chrR and chrP

β is typically 2
chrFβ = ( 1 +β)² [chrP x chrR / β² x chrP +chrR]

Question 19

Q

What does chrF2,3 notation mean

Answer

A

β = 2
K = 3

Question 20

Q

What is the task of ATS

Answer

A

produce a summary of a full-length document
input sequence: full-length text (source)
output sequence: summarised text (target)

Question 21

Q

4 types of ATS

Answer

A

input: single or multi doc

language: mono-, multi, cross-lingual

learning: supervised or unsupervised

generation: extractive vs abstractive

Question 22

Q

What are the statistical approaches of Extractive ATS

Answer

A

word-frequency based
tf idf based

Question 23

Q

what are machine learning based approaches to extractive ATS

Answer

A

binary classifiers
grpah-based

Question 24

Q

what does the decoder act as in ATS

Answer

A

extractor

Question 25

Q

How does attentional encoder-decoder RNN for ATS work

Answer

A

Has the generic encoder and decoder mechanism Attention layer from encoder hidden states to each decoder state
Feature-rich encoder: linguistic features combined with embeddings
switching generator/pointer model

Question 26

Q

what is the purpose of a feature rich encoder

Answer

A

improves the performance of text summarisation by capturing linguistic features

Question 27

Q

What is the switching generator/pointer model

Answer

A

issue with ats: there will be words in test data that the model has not seen before

At any point during decoding, compute a probability (switch) - decides whether to generate or to point

Instead of emitting <UNK> for OOV words, model points to the word's position in the input document</UNK>

Question 28

Q

How do transformers work for ATS

Answer

A

repurposing the training for LMs (teacher forcing)
(predicting next word)

During inference (after training)
the source (full length doc), with separator token δ appended is used as input
target is generated in an autoregressive manner

model has access to the source as well as tokens in the target generated so far

Question 29

Q

What is ROUGE

Answer

A

Automatic evaluation for summaries
Recall-Oriented Understudy for Gisting Evaluation

Counts the number of overlapping units (ie n-grams) between the generated (candidate) and reference summaries

Question 30

Q

How to calculate ROUGE-N

Answer

A

n-gram recall
Σ count(match) gram / Σcount gram

(Σcount gram = number of n-grams in the candidate)

Question 31

Q

What is ROUGE-L

Answer

A

The Longest common subsequence (LCS)- based F score

subsequences are ordered but can have gaps (words in between)

assuming X and Y are the reference and candidate summaries, with lengths m and n tokens respectively

precision, recall -> f score

Question 32

Q

How to calculate Rlcs

Answer

A

LCS (X,Y) / m

m = length of reference summary

Question 33

Q

How to calculate Plcs

Answer

A

LCS (X,Y) / n

n = length of candidate summary

Question 34

Q

How to calculate Flcs

Answer

A

(1+ β²) Rlcs Plcs / Rlcs + β²Plcs

(β = 1 for F1 score)

Question 35

Q

What is ROUGE-S

Answer

A

based on skip-bigram co-occurrences

assuming again X,Y = reference summary, candidate summary

again calculate precision and recall -> fscore

Question 36

Q

How to calculate Rskip2

Answer

A

SKIP2(X,Y) / C(m,2)

(matching co occurring skip bigrams in X and Y divided by total skip bigrams in reference)

Question 37

Q

How to calculate Pskip2

Answer

A

SKIP2(X,Y) / C(n,2)

Question 38

Q

How to calculate Fskip2

Answer

A

(1+ β²) Rskip2 Pskip2 / Rskip2 + β²Pskip2

Question 39

Q

What are skip bigrams

Answer

A

includes pairs that skip one word
“the quick brown fox”

example skip bigram = “the brown”, “the fox”
still includes no skips = “brown fox”

Question 40

Q

ROUGE-S vs ROUGE-L

Answer

A

Rouge-l does not handle the case of multiple long subsequences (only finds longest)
rouge-s can handle this better

Question 41

Q

What is autoregressive generation

Answer

A

generation of output sequentially, where each output token depends on the previously generated tokens
eg in ATS, in MT, in attention
everytime the decoder output is sent back in as input

Brainscape's Knowledge GenomeTM

Week 8 - seq2seq learning Flashcards

Brainscape's Knowledge Genome^TM