Week 3 - Transformer Models Flashcards

1
Q

What is an encoder decoder networks

A

Also known as sequence-to-sequence networks
Generate arbitrary length output sequences which are contextually appropriate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 3 componenets of encoder decoder networks

A

Encoder
Context vector (a function of h1:n from encoder)
Decoder

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does an encoder do in encoder decoder networks

A

Takes an input sequence x1:n and generates a corresponding sequence of contextualised representations h1:n
(captures the information and dependencies between tokens)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does a decoder do in encoder decoder networks

A

Takes c (context vector) as input and generates a sequence of hidden states h1:m from which an output sequence y1:m can be generated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can encoders and decoders be realised

A

By sequence based architectures
eg RNN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is the purpose of <s></s>

A

Marks the end of the source text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In the decoder for translation, what is each word generation conditioned on

A

The previous hidden state and the embedding for the last word generated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does the encoder work for machine translation (during inference time)

A

takes the source text
Encoder network generates the hidden states
Uses the input embedding token as well as the output from the previous hidden state (ht-1)
The output of the source is essentially ignored

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does the decoder work for machine translation (during inference time)

A

The decoder takes the last hidden state of the encoder and the first token (which will be <s>) to predict the first translated word
(repeats)</s>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What do h with superscripts e and d represent in transformer diagrams

A

hidden states in the encoder/decoder

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is the context vector formed and used in machine translation

A

Encoder generates h^e n = c
Decoder uses c as its initial hidden state c = h^d 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which states in the decoder is c available to

A

It is available to all states
otherwise “context weakens”
c is added as a parameter in the computation of each h^d

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is context

A

All information from the source text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does rnn work for machine translation during training

A

in the same way the source text is processed in the encoder
In the decoder, the separator token and last hidden state are taken
sent through softmax to create a probability distribution
Then the loss is calculated using CE loss between each generated token and the correct token (gold answers)
average the loss across all states -> adjust weight accordingly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is the difference between the inference and training stages in rnn

A

training uses teacher forcing
i.e. if an incorrect token is output, it will not be used as input for the succeeding states
Instead correct token will be used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the bottleneck

A

final state h^e n represents all information on the source text
may not sufficiently represent all necessary information
decoder relies heavily on c

17
Q

What is the purpose of the attention mechanism

A

Addresses the bottleneck
allows the decoder to obtain information from all hidden states in the encoder

18
Q

how is an attention ‘weighted’

A

Weights determine which parts of source text are relevant for current decoder token
Allows focusing on specific portions of input sequence during decoding.

19
Q

What is different about c in attention mechansim

A

c is dynamically derived
Varies based on decoder token being produced

20
Q

What is each hidden state h^d i in the attention conditioned on

A

Prior hidden state
Previous token
Dynamically generated context vector ci

21
Q

What are transformers

A

deep learning model that does sequence processing without the need for recurrent connections

22
Q

What is the shortcoming of sequence based architectures (i.e. RNNs)

A

Cannot perform calculations in parallel

23
Q

What is a causal(left to right) transformer

A

For each input, the network has access to all inputs up to and inc current one
No access to input information beyond current
computation of each item is done independently and hence parallelised
makes use of self-attention: information in contexts can be accessed without recurrent connections

24
Q

What is the causal model as a LM

A

main difference is there are no reccurent connections, replaced with a transformer block
For each input word, calculate self attention between that word and every other word preceding it
means we can parallelise

25
Q

What are bidirectional transformer encoders

A

Allow the self-attention mechanism to range over the entire input
not autoregressive
parallelised
map sequences of embeddings to sequences of output embeddings which are contextualised using information from the entire input sequence
contextualization done via self attention

26
Q

What are 3 types of LMs

A
  • transformer based
  • RNNs
  • n-grams
27
Q

Encoder-decoder networks encompass which models

A

transformer-based (standard transformers, BERT, GPT)
RNNs (standard RNNs, LSTMs, GRUs),
CNNs

(attention mechanism can be added to all of these but is not an encoder-decoder network itself)

28
Q

what is inference

A

testing

29
Q

does the causal model for LM use teacher forcing

A

yes
if the model predicts the wrong word, the truth label word is used as input