DL-08 - Transformers Flashcards

Question 1

Q

DL-08 - Transformers

What are some problems with RNNs/LSTMs? (4)

Answer

A

Difficult to train.
Very long gradient paths.
Transfer learning never really works.
Recurrence is against the principle of parallel computation.

Question 2

Q

DL-08 - Transformers

What is the name of the paper where transformers were introduced?

Answer

A

Attention is All You Need

Question 3

Q

DL-08 - Transformers

When was the transformers paper (Attention is All You Need) published?

Question 4

Q

DL-08 - Transformers

Who were the authors of the transformers paper (Attention is All You Need)

Answer

A

Vaswani et al.

Question 5

Q

DL-08 - Transformers

What do transformers use instead of recurrence? (2)

Answer

A

Context windows (input more data at the same time)
self-attention

Question 6

Q

DL-08 - Transformers

In what areas are transformers currently very good? (2)

Answer

A

NLP
Computer vision

Question 7

Q

DL-08 - Transformers

What does a transformer do? (2)

Answer

A

Encodes an input into a single vector
Decodes the vector back into output

Question 8

Q

DL-08 - Transformers

Does transformers use recursion?

Answer

A

No, they avoid it.

Question 9

Q

DL-08 - Transformers

Why can encoders be so fast?

Answer

A

No recursion -> parallel computation.

Question 10

Q

DL-08 - Transformers

What are the main characteristics of transformers? (3)

Answer

A

non-sequential
self-attention
positional encoding

Question 11

Q

DL-08 - Transformers

Describe what is meant when we say transformers are non-sequential.

Answer

A

Sentences are processed as a whole, rather than word by word.

Question 12

Q

DL-08 - Transformers

“Sentences are processed as a whole, rather than word by word.”
What is this property called?

Answer

A

Non-sequential.

Question 13

Q

DL-08 - Transformers

Describe self-attention.

Answer

A

A new unit used to compute similarity scores between words in a sentence.

Question 14

Q

DL-08 - Transformers

Describe positional encoding.

Answer

A

Encodes information related to a position of a token in a sentence.

Question 15

Q

DL-08 - Transformers

“Encodes information related to a position of a token in a sentence.”
What is this called?

Answer

A

positional encoding

Question 16

Q

DL-08 - Transformers

“A new unit used to compute similarity scores between words in a sentence.”
What is this called?

Answer

A

Self-atttention.

Question 17

Q

DL-08 - Transformers

What is the method transformers use to understand relevant words while processing a current word?

Answer

A

Self-atttention.

Question 18

Q

DL-08 - Transformers

Why don’t transformers suffer from short-term memory?

Answer

A

Because they use self-attention mechanisms, allowing them to take the entire input sequence into account simultaneously.

Question 19

Q

DL-08 - Transformers

What parts does “encoder embedding” consist of?

Answer

A

Word/input embedding
Positional embedding

Question 20

Q

DL-08 - Transformers

What does positional embedding do?

Answer

A

It injects positional information (distance between different words) into the input embeddings.

Question 21

Q

DL-08 - Transformers

What functions does the “Attention is all you need” paper use for positional encoding?

Question 22

Q

DL-08 - Transformers

In the image, what is
- d_model
- i
- pos

(See image)

Answer

A

d_model: Embedding size
i: Depends on the position in the embedding dimension
pos: Position index in the incoming sequence.

Question 23

Q

DL-08 - Transformers

How is positional information added to the embeddings?

Answer

A

They’re added element-wise (Addition).

Question 24

Q

DL-08 - Transformers

What are the sub-modules of the encoder? (2)

Answer

A

Multi-headed attention
Fully connected feed forward network

Question 25

Q

DL-08 - Transformers

What do each of the sub-modules have (both attention head and FC module)? (2)

Answer

A

Residual connections
Normalization layer

Question 26

Q

DL-08 - Transformers

What are the vectors in self-attention called? (3)

Answer

A

Query
Key
Value

Question 27

Q

DL-08 - Transformers

How are the query, key and value vectors created?

Answer

A

Separate weight matrix for each. Each vector is simply a multiplication with the incoming embedding vector. (See image)

Question 28

Q

DL-08 - Transformers

How is “score” calculated in self-attention?

@CHECK ME

Answer

A

By taking the dot product of the Q and K vectors.

Question 29

Q

DL-08 - Transformers

What do you get when you take the dot product of the Q and K vectors?

@CHECK_ME

Answer

A

The “score”.

Question 30

Q

DL-08 - Transformers

How do you ensure stable gradients?

Answer

A

Scale down the “scores”.

Question 31

Q

DL-08 - Transformers

What is the formula for scaling the scores?

Answer

A

(See image)

Question 32

Q

DL-08 - Transformers

How do you normalize the scores?

Answer

A

Use softmax to produce attention weights with probability between 0-1.

Question 33

Q

DL-08 - Transformers

How do you calculate the final attention weights?

Answer

A

Use softmax normalize the scores, to produce attention weights with probability between 0-1.

Question 34

Q

DL-08 - Transformers

How do you get the output vector of a self-attention unit?

Answer

A

Calculate the self-attention score/weights, then multiply it by the value vector.

Question 35

Q

DL-08 - Transformers

How do you write the self-attention block as a single matrix operation?

Answer

A

(See image)

Question 36

Q

DL-08 - Transformers

What is depicted in the image? (See image)

Answer

A

Most of the self-attention block as a matrix operation.

Question 37

Q

DL-08 - Transformers

What is Multi-headed attention?

Answer

A

A block that uses N different self-attentions (called heads) with different Q, K, V to produce outputs Z_1, Z_2, Z_3, …, Z_n.

Question 38

Q

DL-08 - Transformers

What is it called when you use multiple self-attention blocks in the same layer?

Answer

A

Multi-headed attention

Question 39

Q

DL-08 - Transformers

What is a self-attention block called?

Question 40

Q

DL-08 - Transformers

What is a head?

Answer

A

One self-attention block.

Question 41

Q

DL-08 - Transformers

What does multi-headed attention do for the layer?

Answer

A

It allows the layer to have multiple representation subspaces.

Question 42

Q

DL-08 - Transformers

What is done to the outputs of the individual self-attention blocks, to make them outputs of a multi-headed attention block?

Answer

A

They’re concatenated into single matrix and multiplied with a weight matrix W^O.

Question 43

Q

DL-08 - Transformers

What is in the image? (See image)

Answer

A

A transformer encoder block.

Question 44

Q

DL-08 - Transformers

Label the masked parts of the image. (See image)

Answer

A

(See image)

Question 45

Q

DL-08 - Transformers

What is in the image? (See image)

Answer

A

A transformer decoder block

Question 46

Q

DL-08 - Transformers

Label the masked parts of the image. (See image)

Answer

A

(See image)

Question 47

Q

DL-08 - Transformers

What happens to the outputs of a transformer decoder block?

Answer

A

(See image)

Question 48

Q

DL-08 - Transformers

What sub-layers does the decoder have? (3)

Answer

A

2 multi-headed attention layers,
a feed-forward layer,
residual connections and normalization layers after each sub-layer

Question 49

Q

DL-08 - Transformers

What are decoder embedding comprised of? (2)

Answer

A

Output word embedding
Positional embedding

Question 50

Q

DL-08 - Transformers

What is fed into the first multi-head attention layer in a Transformer decoder?

Answer

A

The output of the Transformer decoder embedding.

Question 51

Q

DL-08 - Transformers

In the transformer’s decoder, how is the first attention head different than the encoder’s attention head?

Answer

A

It uses a look-ahead mask.

Question 52

Q

DL-08 - Transformers

In sequence models, what is the purpose of a look-ahead mask used in a decoder with multi-head attention?

Answer

A

To prevent the decoder from conditioning to future tokens.

Question 53

Q

DL-08 - Transformers

How do you create a look-ahead mask?

Answer

A

(See image)

Question 54

Q

DL-08 - Transformers

Where is the mask applied in a decoder?

Answer

A

(See image)

Question 55

Q

DL-08 - Transformers

What are the input to the decoder’s 2nd attention head? (2)

Answer

A

Query and key from encoder
Value from 1st attention head.

Question 56

Q

DL-08 - Transformers

What is another way to think of the 2nd attention head in the decoder?

Answer

A

Encoder-decoder attention

Question 57

Q

DL-08 - Transformers

What happens to the output of the decoder block?

Answer

A

It’s sent through a linear classifier, then a softmax activation.

(See image)

Question 58

Q

DL-08 - Transformers

How do we interpret the output of a transformer?

Answer

A

It’s a probability distribution over the words in your vocabulary.

(We try to predict the next word.)

Question 59

Q

DL-08 - Transformers

What is a stacked encoder/decoder?

Answer

A

Adding multiple layers of encoders/decoders to improve performance. (See image)

Question 60

Q

DL-08 - Transformers

What are some popular transformers mentioned in the paper? (5)

Answer

A

BERT
OpenAI’s GPT family
Google Bard
XLNet
T5

Question 61

Q

DL-08 - Transformers

What is BERT short for?

Answer

A

Bidirectional Encoder Representations from Transformers

Question 62

Q

DL-08 - Transformers

What is GPT short for?

Answer

A

Generative Pretrained Transformer

Question 63

Q

DL-08 - Transformers

What is T5 short for? (TTTTT)

Answer

A

Text-To-Text Transfer Transformer

Question 64

Q

DL-08 - Transformers

When was BERT released?

Answer 61

A

Masked Language Model (MLM)
Next Sentence Prediction (NSP)

Answer 62

A

Masked Language Model

Answer 63

A

Next Sentence Prediction

Answer 64

A

Bidirectional

Answer 65

A

Classification
Fill in the blanks
Question answering

Answer 66

A

RoBERTa
ALBERT
StructBERT
DeBERTa

Answer 67

A

A Robustly Optimized BERT Pretraining Approach

Answer 68

A

A Lite BERT for Self-supervised Learning of Language Representations

Answer 69

A

Incorporating Language Structures into Pre-training for Deep Language Understanding

Answer 70

A

Predicting the next word in a sequence.

Answer 71

A

Using RLHF (Reinforcement learning from human feedback)

Answer 72

A

Reinforcement learning from human feedback

Answer 73

A

96 layers

Answer 74

A

96 attention heads

Answer 75

A

Using transformers for computer vision?

Answer 76

A

Vision transformer

Answer 77

A

Dosovitskiy et al. from Google Brain

Answer 78

A

(See image)

Answer 79

A

Data is split into small patches, e.g. 16x16.

Answer 80

A

(See image)

Answer 81

A

(See image)

Answer 82

A

(See image)

Answer 83

A

(See image)

Answer 84

A

(See image)

Answer 85

A

(See image)

Answer 86

A

A ViT (Vision transformer).