Week 3 - Transformer Models Flashcards

Question 1

Q

What is an encoder decoder networks

Answer

A

Also known as sequence-to-sequence networks
Generate arbitrary length output sequences which are contextually appropriate

Question 2

Q

What are the 3 componenets of encoder decoder networks

Answer

A

Encoder
Context vector (a function of h1:n from encoder)
Decoder

Question 3

Q

What does an encoder do in encoder decoder networks

Answer

A

Takes an input sequence x1:n and generates a corresponding sequence of contextualised representations h1:n
(captures the information and dependencies between tokens)

Question 4

Q

What does a decoder do in encoder decoder networks

Answer

A

Takes c (context vector) as input and generates a sequence of hidden states h1:m from which an output sequence y1:m can be generated

Question 5

Q

How can encoders and decoders be realised

Answer

A

By sequence based architectures
eg RNN

Question 6

Q

what is the purpose of <s></s>

Answer

A

Marks the end of the source text

Question 7

Q

In the decoder for translation, what is each word generation conditioned on

Answer

A

The previous hidden state and the embedding for the last word generated

Question 8

Q

How does the encoder work for machine translation (during inference time)

Answer

A

takes the source text
Encoder network generates the hidden states
Uses the input embedding token as well as the output from the previous hidden state (ht-1)
The output of the source is essentially ignored

Question 9

Q

How does the decoder work for machine translation (during inference time)

Answer

A

The decoder takes the last hidden state of the encoder and the first token (which will be <s>) to predict the first translated word
(repeats)</s>

Question 10

Q

What do h with superscripts e and d represent in transformer diagrams

Answer

A

hidden states in the encoder/decoder

Question 11

Q

How is the context vector formed and used in machine translation

Answer

A

Encoder generates h^e n = c
Decoder uses c as its initial hidden state c = h^d 0

Question 12

Q

Which states in the decoder is c available to

Answer

A

It is available to all states
otherwise “context weakens”
c is added as a parameter in the computation of each h^d

Question 13

Q

What is context

Answer

A

All information from the source text

Question 14

Q

How does rnn work for machine translation during training

Answer

A

in the same way the source text is processed in the encoder
In the decoder, the separator token and last hidden state are taken
sent through softmax to create a probability distribution
Then the loss is calculated using CE loss between each generated token and the correct token (gold answers)
average the loss across all states -> adjust weight accordingly

Question 15

Q

what is the difference between the inference and training stages in rnn

Answer

A

training uses teacher forcing
i.e. if an incorrect token is output, it will not be used as input for the succeeding states
Instead correct token will be used

Question 16

Q

What is the bottleneck

Answer

A

final state h^e n represents all information on the source text
may not sufficiently represent all necessary information
decoder relies heavily on c

Question 17

Q

What is the purpose of the attention mechanism

Answer

A

Addresses the bottleneck
allows the decoder to obtain information from all hidden states in the encoder

Question 18

Q

how is an attention ‘weighted’

Answer

A

Weights determine which parts of source text are relevant for current decoder token
Allows focusing on specific portions of input sequence during decoding.

Question 19

Q

What is different about c in attention mechansim

Answer

A

c is dynamically derived
Varies based on decoder token being produced

Question 20

Q

What is each hidden state h^d i in the attention conditioned on

Answer

A

Prior hidden state
Previous token
Dynamically generated context vector ci

Question 21

Q

What are transformers

Answer

A

deep learning model that does sequence processing without the need for recurrent connections

Question 22

Q

What is the shortcoming of sequence based architectures (i.e. RNNs)

Answer

A

Cannot perform calculations in parallel

Question 23

Q

What is a causal(left to right) transformer

Answer

A

For each input, the network has access to all inputs up to and inc current one
No access to input information beyond current
computation of each item is done independently and hence parallelised
makes use of self-attention: information in contexts can be accessed without recurrent connections

Question 24

Q

What is the causal model as a LM

Answer

A

main difference is there are no reccurent connections, replaced with a transformer block
For each input word, calculate self attention between that word and every other word preceding it
means we can parallelise

Question 25

Q

What are bidirectional transformer encoders

Answer

A

Allow the self-attention mechanism to range over the entire input
not autoregressive
parallelised
map sequences of embeddings to sequences of output embeddings which are contextualised using information from the entire input sequence
contextualization done via self attention

Question 26

Q

What are 3 types of LMs

Answer

A

transformer based
RNNs
n-grams

Question 27

Q

Encoder-decoder networks encompass which models

Answer

A

transformer-based (standard transformers, BERT, GPT)
RNNs (standard RNNs, LSTMs, GRUs),
CNNs

(attention mechanism can be added to all of these but is not an encoder-decoder network itself)

Question 28

Q

what is inference

Question 29

Q

does the causal model for LM use teacher forcing

Answer

A

yes
if the model predicts the wrong word, the truth label word is used as input

Brainscape's Knowledge GenomeTM

Week 3 - Transformer Models Flashcards

Brainscape's Knowledge Genome^TM