DL-08 - Transformers Flashcards
DL-08 - Transformers
What are some problems with RNNs/LSTMs? (4)
- Difficult to train.
- Very long gradient paths.
- Transfer learning never really works.
- Recurrence is against the principle of parallel computation.
DL-08 - Transformers
What is the name of the paper where transformers were introduced?
Attention is All You Need
DL-08 - Transformers
When was the transformers paper (Attention is All You Need) published?
2017
DL-08 - Transformers
Who were the authors of the transformers paper (Attention is All You Need)
Vaswani et al.
DL-08 - Transformers
What do transformers use instead of recurrence? (2)
- Context windows (input more data at the same time)
- self-attention
DL-08 - Transformers
In what areas are transformers currently very good? (2)
- NLP
- Computer vision
DL-08 - Transformers
What does a transformer do? (2)
- Encodes an input into a single vector
- Decodes the vector back into output
DL-08 - Transformers
Does transformers use recursion?
No, they avoid it.
DL-08 - Transformers
Why can encoders be so fast?
No recursion -> parallel computation.
DL-08 - Transformers
What are the main characteristics of transformers? (3)
- non-sequential
- self-attention
- positional encoding
DL-08 - Transformers
Describe what is meant when we say transformers are non-sequential.
Sentences are processed as a whole, rather than word by word.
DL-08 - Transformers
“Sentences are processed as a whole, rather than word by word.”
What is this property called?
Non-sequential.
DL-08 - Transformers
Describe self-attention.
A new unit used to compute similarity scores between words in a sentence.
DL-08 - Transformers
Describe positional encoding.
Encodes information related to a position of a token in a sentence.
DL-08 - Transformers
“Encodes information related to a position of a token in a sentence.”
What is this called?
positional encoding
DL-08 - Transformers
“A new unit used to compute similarity scores between words in a sentence.”
What is this called?
Self-atttention.
DL-08 - Transformers
What is the method transformers use to understand relevant words while processing a current word?
Self-atttention.
DL-08 - Transformers
Why don’t transformers suffer from short-term memory?
Because they use self-attention mechanisms, allowing them to take the entire input sequence into account simultaneously.
DL-08 - Transformers
What parts does “encoder embedding” consist of?
- Word/input embedding
- Positional embedding
DL-08 - Transformers
What does positional embedding do?
It injects positional information (distance between different words) into the input embeddings.
DL-08 - Transformers
What functions does the “Attention is all you need” paper use for positional encoding?
Sin/cos
DL-08 - Transformers
In the image, what is
- d_model
- i
- pos
(See image)
- d_model: Embedding size
- i: Depends on the position in the embedding dimension
- pos: Position index in the incoming sequence.
DL-08 - Transformers
How is positional information added to the embeddings?
They’re added element-wise (Addition).
DL-08 - Transformers
What are the sub-modules of the encoder? (2)
- Multi-headed attention
- Fully connected feed forward network
DL-08 - Transformers
What do each of the sub-modules have (both attention head and FC module)? (2)
- Residual connections
- Normalization layer
DL-08 - Transformers
What are the vectors in self-attention called? (3)
- Query
- Key
- Value
DL-08 - Transformers
How are the query, key and value vectors created?
Separate weight matrix for each. Each vector is simply a multiplication with the incoming embedding vector. (See image)
DL-08 - Transformers
How is “score” calculated in self-attention?
@CHECK ME
By taking the dot product of the Q and K vectors.
DL-08 - Transformers
What do you get when you take the dot product of the Q and K vectors?
@CHECK_ME
The “score”.
DL-08 - Transformers
How do you ensure stable gradients?
Scale down the “scores”.
DL-08 - Transformers
What is the formula for scaling the scores?
(See image)
DL-08 - Transformers
How do you normalize the scores?
Use softmax to produce attention weights with probability between 0-1.
DL-08 - Transformers
How do you calculate the final attention weights?
Use softmax normalize the scores, to produce attention weights with probability between 0-1.
DL-08 - Transformers
How do you get the output vector of a self-attention unit?
Calculate the self-attention score/weights, then multiply it by the value vector.
DL-08 - Transformers
How do you write the self-attention block as a single matrix operation?
(See image)
DL-08 - Transformers
What is depicted in the image? (See image)
Most of the self-attention block as a matrix operation.
DL-08 - Transformers
What is Multi-headed attention?
A block that uses N different self-attentions (called heads) with different Q, K, V to produce outputs Z_1, Z_2, Z_3, …, Z_n.
DL-08 - Transformers
What is it called when you use multiple self-attention blocks in the same layer?
Multi-headed attention
DL-08 - Transformers
What is a self-attention block called?
A head.
DL-08 - Transformers
What is a head?
One self-attention block.
DL-08 - Transformers
What does multi-headed attention do for the layer?
It allows the layer to have multiple representation subspaces.
DL-08 - Transformers
What is done to the outputs of the individual self-attention blocks, to make them outputs of a multi-headed attention block?
They’re concatenated into single matrix and multiplied with a weight matrix W^O.
DL-08 - Transformers
What is in the image? (See image)
A transformer encoder block.
DL-08 - Transformers
Label the masked parts of the image. (See image)
(See image)
DL-08 - Transformers
What is in the image? (See image)
A transformer decoder block
DL-08 - Transformers
Label the masked parts of the image. (See image)
(See image)
DL-08 - Transformers
What happens to the outputs of a transformer decoder block?
(See image)
DL-08 - Transformers
What sub-layers does the decoder have? (3)
- 2 multi-headed attention layers,
- a feed-forward layer,
- residual connections and normalization layers after each sub-layer
DL-08 - Transformers
What are decoder embedding comprised of? (2)
- Output word embedding
- Positional embedding
DL-08 - Transformers
What is fed into the first multi-head attention layer in a Transformer decoder?
The output of the Transformer decoder embedding.
DL-08 - Transformers
In the transformer’s decoder, how is the first attention head different than the encoder’s attention head?
It uses a look-ahead mask.
DL-08 - Transformers
In sequence models, what is the purpose of a look-ahead mask used in a decoder with multi-head attention?
To prevent the decoder from conditioning to future tokens.
DL-08 - Transformers
How do you create a look-ahead mask?
(See image)
DL-08 - Transformers
Where is the mask applied in a decoder?
(See image)
DL-08 - Transformers
What are the input to the decoder’s 2nd attention head? (2)
- Query and key from encoder
- Value from 1st attention head.
DL-08 - Transformers
What is another way to think of the 2nd attention head in the decoder?
Encoder-decoder attention
DL-08 - Transformers
What happens to the output of the decoder block?
It’s sent through a linear classifier, then a softmax activation.
(See image)
DL-08 - Transformers
How do we interpret the output of a transformer?
It’s a probability distribution over the words in your vocabulary.
(We try to predict the next word.)
DL-08 - Transformers
What is a stacked encoder/decoder?
Adding multiple layers of encoders/decoders to improve performance. (See image)
DL-08 - Transformers
What are some popular transformers mentioned in the paper? (5)
- BERT
- OpenAI’s GPT family
- Google Bard
- XLNet
- T5
DL-08 - Transformers
What is BERT short for?
Bidirectional Encoder Representations from Transformers
DL-08 - Transformers
What is GPT short for?
Generative Pretrained Transformer
DL-08 - Transformers
What is T5 short for? (TTTTT)
Text-To-Text Transfer Transformer
DL-08 - Transformers
When was BERT released?
2018
DL-08 - Transformers
When was the first GPT released?
2018
DL-08 - Transformers
When was XLNet released?
2020
DL-08 - Transformers
When was T5 released?
2020
DL-08 - Transformers
What are the two novel techniques used by BERT? (2)
- Masked Language Model (MLM)
- Next Sentence Prediction (NSP)
DL-08 - Transformers
What is MLM short for?
Masked Language Model
DL-08 - Transformers
What is NSP short for?
Next Sentence Prediction
DL-08 - Transformers
What does BERT used to better determine context?
Bidirectional
DL-08 - Transformers
What are some tasks where BERT is useful? (3)
- Classification
- Fill in the blanks
- Question answering
DL-08 - Transformers
What variants of BERT mentioned in the lecture slides? (4)
- RoBERTa
- ALBERT
- StructBERT
- DeBERTa
DL-08 - Transformers
What’s special about RoBERTa?
A Robustly Optimized BERT Pretraining Approach
DL-08 - Transformers
What’s special about ALBERT?
A Lite BERT for Self-supervised Learning of Language Representations
DL-08 - Transformers
What’s special about StructBERT?
Incorporating Language Structures into Pre-training for Deep Language Understanding
DL-08 - Transformers
What objective was GPT trained with?
Predicting the next word in a sequence.
DL-08 - Transformers
How are GPTs traiend?
Using RLHF (Reinforcement learning from human feedback)
DL-08 - Transformers
What is RLHF short for?
Reinforcement learning from human feedback
DL-08 - Transformers
How many layers does GPT-3 have?
96 layers
DL-08 - Transformers
How many attention heads per layer does GPT-3 have?
96 attention heads
DL-08 - Transformers
What is a vision transformer?
Using transformers for computer vision?
DL-08 - Transformers
What is ViT short for?
Vision transformer
DL-08 - Transformers
Who first published vision transformers for imagenet?
Dosovitskiy et al. from Google Brain
DL-08 - Transformers
When were vision transformers first published?
2020
DL-08 - Transformers
What is the architecture of vision transformers?
(See image)
DL-08 - Transformers
How are images preprocessed for use in a vision transformer?
Data is split into small patches, e.g. 16x16.
DL-08 - Transformers
Label the masked parts of the image.
(See image)
DL-08 - Transformers
Label the masked parts of the image.
(See image)
DL-08 - Transformers
Label the masked parts of the image.
(See image)
DL-08 - Transformers
Label the masked parts of the image.
(See image)
DL-08 - Transformers
Label the masked parts of the image.
(See image)
DL-08 - Transformers
Label the masked parts of the image.
(See image)
DL-08 - Transformers
What’s depicted in the image?
A ViT (Vision transformer).