DL-08 - Transformers Flashcards

1
Q

DL-08 - Transformers

What are some problems with RNNs/LSTMs? (4)

A
  • Difficult to train.
  • Very long gradient paths.
  • Transfer learning never really works.
  • Recurrence is against the principle of parallel computation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

DL-08 - Transformers

What is the name of the paper where transformers were introduced?

A

Attention is All You Need

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

DL-08 - Transformers

When was the transformers paper (Attention is All You Need) published?

A

2017

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

DL-08 - Transformers

Who were the authors of the transformers paper (Attention is All You Need)

A

Vaswani et al.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

DL-08 - Transformers

What do transformers use instead of recurrence? (2)

A
  • Context windows (input more data at the same time)
  • self-attention
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

DL-08 - Transformers

In what areas are transformers currently very good? (2)

A
  • NLP
  • Computer vision
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

DL-08 - Transformers

What does a transformer do? (2)

A
  • Encodes an input into a single vector
  • Decodes the vector back into output
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

DL-08 - Transformers

Does transformers use recursion?

A

No, they avoid it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

DL-08 - Transformers

Why can encoders be so fast?

A

No recursion -> parallel computation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

DL-08 - Transformers

What are the main characteristics of transformers? (3)

A
  • non-sequential
  • self-attention
  • positional encoding
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

DL-08 - Transformers

Describe what is meant when we say transformers are non-sequential.

A

Sentences are processed as a whole, rather than word by word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

DL-08 - Transformers

“Sentences are processed as a whole, rather than word by word.”
What is this property called?

A

Non-sequential.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

DL-08 - Transformers

Describe self-attention.

A

A new unit used to compute similarity scores between words in a sentence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

DL-08 - Transformers

Describe positional encoding.

A

Encodes information related to a position of a token in a sentence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

DL-08 - Transformers

“Encodes information related to a position of a token in a sentence.”
What is this called?

A

positional encoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

DL-08 - Transformers

“A new unit used to compute similarity scores between words in a sentence.”
What is this called?

A

Self-atttention.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

DL-08 - Transformers

What is the method transformers use to understand relevant words while processing a current word?

A

Self-atttention.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

DL-08 - Transformers

Why don’t transformers suffer from short-term memory?

A

Because they use self-attention mechanisms, allowing them to take the entire input sequence into account simultaneously.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

DL-08 - Transformers

What parts does “encoder embedding” consist of?

A
  • Word/input embedding
  • Positional embedding
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

DL-08 - Transformers

What does positional embedding do?

A

It injects positional information (distance between different words) into the input embeddings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

DL-08 - Transformers

What functions does the “Attention is all you need” paper use for positional encoding?

A

Sin/cos

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

DL-08 - Transformers

In the image, what is
- d_model
- i
- pos

(See image)

A
  • d_model: Embedding size
  • i: Depends on the position in the embedding dimension
  • pos: Position index in the incoming sequence.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

DL-08 - Transformers

How is positional information added to the embeddings?

A

They’re added element-wise (Addition).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

DL-08 - Transformers

What are the sub-modules of the encoder? (2)

A
  • Multi-headed attention
  • Fully connected feed forward network
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

DL-08 - Transformers

What do each of the sub-modules have (both attention head and FC module)? (2)

A
  • Residual connections
  • Normalization layer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

DL-08 - Transformers

What are the vectors in self-attention called? (3)

A
  • Query
  • Key
  • Value
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

DL-08 - Transformers

How are the query, key and value vectors created?

A

Separate weight matrix for each. Each vector is simply a multiplication with the incoming embedding vector. (See image)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

DL-08 - Transformers

How is “score” calculated in self-attention?

@CHECK ME

A

By taking the dot product of the Q and K vectors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

DL-08 - Transformers

What do you get when you take the dot product of the Q and K vectors?

@CHECK_ME

A

The “score”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

DL-08 - Transformers

How do you ensure stable gradients?

A

Scale down the “scores”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

DL-08 - Transformers

What is the formula for scaling the scores?

A

(See image)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

DL-08 - Transformers

How do you normalize the scores?

A

Use softmax to produce attention weights with probability between 0-1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

DL-08 - Transformers

How do you calculate the final attention weights?

A

Use softmax normalize the scores, to produce attention weights with probability between 0-1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

DL-08 - Transformers

How do you get the output vector of a self-attention unit?

A

Calculate the self-attention score/weights, then multiply it by the value vector.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

DL-08 - Transformers

How do you write the self-attention block as a single matrix operation?

A

(See image)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

DL-08 - Transformers

What is depicted in the image? (See image)

A

Most of the self-attention block as a matrix operation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

DL-08 - Transformers

What is Multi-headed attention?

A

A block that uses N different self-attentions (called heads) with different Q, K, V to produce outputs Z_1, Z_2, Z_3, …, Z_n.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

DL-08 - Transformers

What is it called when you use multiple self-attention blocks in the same layer?

A

Multi-headed attention

39
Q

DL-08 - Transformers

What is a self-attention block called?

A

A head.

40
Q

DL-08 - Transformers

What is a head?

A

One self-attention block.

41
Q

DL-08 - Transformers

What does multi-headed attention do for the layer?

A

It allows the layer to have multiple representation subspaces.

42
Q

DL-08 - Transformers

What is done to the outputs of the individual self-attention blocks, to make them outputs of a multi-headed attention block?

A

They’re concatenated into single matrix and multiplied with a weight matrix W^O.

43
Q

DL-08 - Transformers

What is in the image? (See image)

A

A transformer encoder block.

44
Q

DL-08 - Transformers

Label the masked parts of the image. (See image)

A

(See image)

45
Q

DL-08 - Transformers

What is in the image? (See image)

A

A transformer decoder block

46
Q

DL-08 - Transformers

Label the masked parts of the image. (See image)

A

(See image)

47
Q

DL-08 - Transformers

What happens to the outputs of a transformer decoder block?

A

(See image)

48
Q

DL-08 - Transformers

What sub-layers does the decoder have? (3)

A
  • 2 multi-headed attention layers,
  • a feed-forward layer,
  • residual connections and normalization layers after each sub-layer
49
Q

DL-08 - Transformers

What are decoder embedding comprised of? (2)

A
  • Output word embedding
  • Positional embedding
50
Q

DL-08 - Transformers

What is fed into the first multi-head attention layer in a Transformer decoder?

A

The output of the Transformer decoder embedding.

51
Q

DL-08 - Transformers

In the transformer’s decoder, how is the first attention head different than the encoder’s attention head?

A

It uses a look-ahead mask.

52
Q

DL-08 - Transformers

In sequence models, what is the purpose of a look-ahead mask used in a decoder with multi-head attention?

A

To prevent the decoder from conditioning to future tokens.

53
Q

DL-08 - Transformers

How do you create a look-ahead mask?

A

(See image)

54
Q

DL-08 - Transformers

Where is the mask applied in a decoder?

A

(See image)

55
Q

DL-08 - Transformers

What are the input to the decoder’s 2nd attention head? (2)

A
  • Query and key from encoder
  • Value from 1st attention head.
56
Q

DL-08 - Transformers

What is another way to think of the 2nd attention head in the decoder?

A

Encoder-decoder attention

57
Q

DL-08 - Transformers

What happens to the output of the decoder block?

A

It’s sent through a linear classifier, then a softmax activation.

(See image)

58
Q

DL-08 - Transformers

How do we interpret the output of a transformer?

A

It’s a probability distribution over the words in your vocabulary.

(We try to predict the next word.)

59
Q

DL-08 - Transformers

What is a stacked encoder/decoder?

A

Adding multiple layers of encoders/decoders to improve performance. (See image)

60
Q

DL-08 - Transformers

What are some popular transformers mentioned in the paper? (5)

A
  • BERT
  • OpenAI’s GPT family
  • Google Bard
  • XLNet
  • T5
61
Q

DL-08 - Transformers

What is BERT short for?

A

Bidirectional Encoder Representations from Transformers

62
Q

DL-08 - Transformers

What is GPT short for?

A

Generative Pretrained Transformer

63
Q

DL-08 - Transformers

What is T5 short for? (TTTTT)

A

Text-To-Text Transfer Transformer

64
Q

DL-08 - Transformers

When was BERT released?

A

2018

65
Q

DL-08 - Transformers

When was the first GPT released?

A

2018

66
Q

DL-08 - Transformers

When was XLNet released?

A

2020

67
Q

DL-08 - Transformers

When was T5 released?

A

2020

68
Q

DL-08 - Transformers

What are the two novel techniques used by BERT? (2)

A
  • Masked Language Model (MLM)
  • Next Sentence Prediction (NSP)
69
Q

DL-08 - Transformers

What is MLM short for?

A

Masked Language Model

70
Q

DL-08 - Transformers

What is NSP short for?

A

Next Sentence Prediction

71
Q

DL-08 - Transformers

What does BERT used to better determine context?

A

Bidirectional

72
Q

DL-08 - Transformers

What are some tasks where BERT is useful? (3)

A
  • Classification
  • Fill in the blanks
  • Question answering
73
Q

DL-08 - Transformers

What variants of BERT mentioned in the lecture slides? (4)

A
  • RoBERTa
  • ALBERT
  • StructBERT
  • DeBERTa
74
Q

DL-08 - Transformers

What’s special about RoBERTa?

A

A Robustly Optimized BERT Pretraining Approach

75
Q

DL-08 - Transformers

What’s special about ALBERT?

A

A Lite BERT for Self-supervised Learning of Language Representations

76
Q

DL-08 - Transformers

What’s special about StructBERT?

A

Incorporating Language Structures into Pre-training for Deep Language Understanding

77
Q

DL-08 - Transformers

What objective was GPT trained with?

A

Predicting the next word in a sequence.

78
Q

DL-08 - Transformers

How are GPTs traiend?

A

Using RLHF (Reinforcement learning from human feedback)

79
Q

DL-08 - Transformers

What is RLHF short for?

A

Reinforcement learning from human feedback

80
Q

DL-08 - Transformers

How many layers does GPT-3 have?

A

96 layers

81
Q

DL-08 - Transformers

How many attention heads per layer does GPT-3 have?

A

96 attention heads

82
Q

DL-08 - Transformers

What is a vision transformer?

A

Using transformers for computer vision?

83
Q

DL-08 - Transformers

What is ViT short for?

A

Vision transformer

84
Q

DL-08 - Transformers

Who first published vision transformers for imagenet?

A

Dosovitskiy et al. from Google Brain

85
Q

DL-08 - Transformers

When were vision transformers first published?

A

2020

86
Q

DL-08 - Transformers

What is the architecture of vision transformers?

A

(See image)

87
Q

DL-08 - Transformers

How are images preprocessed for use in a vision transformer?

A

Data is split into small patches, e.g. 16x16.

88
Q

DL-08 - Transformers

Label the masked parts of the image.

A

(See image)

89
Q

DL-08 - Transformers

Label the masked parts of the image.

A

(See image)

90
Q

DL-08 - Transformers

Label the masked parts of the image.

A

(See image)

91
Q

DL-08 - Transformers

Label the masked parts of the image.

A

(See image)

92
Q

DL-08 - Transformers

Label the masked parts of the image.

A

(See image)

93
Q

DL-08 - Transformers

Label the masked parts of the image.

A

(See image)

94
Q

DL-08 - Transformers

What’s depicted in the image?

A

A ViT (Vision transformer).