Week 3 - Transformer Models Flashcards
What is an encoder decoder networks
Also known as sequence-to-sequence networks
Generate arbitrary length output sequences which are contextually appropriate
What are the 3 componenets of encoder decoder networks
Encoder
Context vector (a function of h1:n from encoder)
Decoder
What does an encoder do in encoder decoder networks
Takes an input sequence x1:n and generates a corresponding sequence of contextualised representations h1:n
(captures the information and dependencies between tokens)
What does a decoder do in encoder decoder networks
Takes c (context vector) as input and generates a sequence of hidden states h1:m from which an output sequence y1:m can be generated
How can encoders and decoders be realised
By sequence based architectures
eg RNN
what is the purpose of <s></s>
Marks the end of the source text
In the decoder for translation, what is each word generation conditioned on
The previous hidden state and the embedding for the last word generated
How does the encoder work for machine translation (during inference time)
takes the source text
Encoder network generates the hidden states
Uses the input embedding token as well as the output from the previous hidden state (ht-1)
The output of the source is essentially ignored
How does the decoder work for machine translation (during inference time)
The decoder takes the last hidden state of the encoder and the first token (which will be <s>) to predict the first translated word
(repeats)</s>
What do h with superscripts e and d represent in transformer diagrams
hidden states in the encoder/decoder
How is the context vector formed and used in machine translation
Encoder generates h^e n = c
Decoder uses c as its initial hidden state c = h^d 0
Which states in the decoder is c available to
It is available to all states
otherwise “context weakens”
c is added as a parameter in the computation of each h^d
What is context
All information from the source text
How does rnn work for machine translation during training
in the same way the source text is processed in the encoder
In the decoder, the separator token and last hidden state are taken
sent through softmax to create a probability distribution
Then the loss is calculated using CE loss between each generated token and the correct token (gold answers)
average the loss across all states -> adjust weight accordingly
what is the difference between the inference and training stages in rnn
training uses teacher forcing
i.e. if an incorrect token is output, it will not be used as input for the succeeding states
Instead correct token will be used