DL-08 - Transformers Flashcards
DL-08 - Transformers
What are some problems with RNNs/LSTMs? (4)
- Difficult to train.
- Very long gradient paths.
- Transfer learning never really works.
- Recurrence is against the principle of parallel computation.
DL-08 - Transformers
What is the name of the paper where transformers were introduced?
Attention is All You Need
DL-08 - Transformers
When was the transformers paper (Attention is All You Need) published?
2017
DL-08 - Transformers
Who were the authors of the transformers paper (Attention is All You Need)
Vaswani et al.
DL-08 - Transformers
What do transformers use instead of recurrence? (2)
- Context windows (input more data at the same time)
- self-attention
DL-08 - Transformers
In what areas are transformers currently very good? (2)
- NLP
- Computer vision
DL-08 - Transformers
What does a transformer do? (2)
- Encodes an input into a single vector
- Decodes the vector back into output
DL-08 - Transformers
Does transformers use recursion?
No, they avoid it.
DL-08 - Transformers
Why can encoders be so fast?
No recursion -> parallel computation.
DL-08 - Transformers
What are the main characteristics of transformers? (3)
- non-sequential
- self-attention
- positional encoding
DL-08 - Transformers
Describe what is meant when we say transformers are non-sequential.
Sentences are processed as a whole, rather than word by word.
DL-08 - Transformers
“Sentences are processed as a whole, rather than word by word.”
What is this property called?
Non-sequential.
DL-08 - Transformers
Describe self-attention.
A new unit used to compute similarity scores between words in a sentence.
DL-08 - Transformers
Describe positional encoding.
Encodes information related to a position of a token in a sentence.
DL-08 - Transformers
“Encodes information related to a position of a token in a sentence.”
What is this called?
positional encoding
DL-08 - Transformers
“A new unit used to compute similarity scores between words in a sentence.”
What is this called?
Self-atttention.
DL-08 - Transformers
What is the method transformers use to understand relevant words while processing a current word?
Self-atttention.
DL-08 - Transformers
Why don’t transformers suffer from short-term memory?
Because they use self-attention mechanisms, allowing them to take the entire input sequence into account simultaneously.
DL-08 - Transformers
What parts does “encoder embedding” consist of?
- Word/input embedding
- Positional embedding
DL-08 - Transformers
What does positional embedding do?
It injects positional information (distance between different words) into the input embeddings.
DL-08 - Transformers
What functions does the “Attention is all you need” paper use for positional encoding?
Sin/cos
DL-08 - Transformers
In the image, what is
- d_model
- i
- pos
(See image)
- d_model: Embedding size
- i: Depends on the position in the embedding dimension
- pos: Position index in the incoming sequence.
DL-08 - Transformers
How is positional information added to the embeddings?
They’re added element-wise (Addition).
DL-08 - Transformers
What are the sub-modules of the encoder? (2)
- Multi-headed attention
- Fully connected feed forward network
DL-08 - Transformers
What do each of the sub-modules have (both attention head and FC module)? (2)
- Residual connections
- Normalization layer
DL-08 - Transformers
What are the vectors in self-attention called? (3)
- Query
- Key
- Value
DL-08 - Transformers
How are the query, key and value vectors created?
Separate weight matrix for each. Each vector is simply a multiplication with the incoming embedding vector. (See image)
DL-08 - Transformers
How is “score” calculated in self-attention?
@CHECK ME
By taking the dot product of the Q and K vectors.
DL-08 - Transformers
What do you get when you take the dot product of the Q and K vectors?
@CHECK_ME
The “score”.
DL-08 - Transformers
How do you ensure stable gradients?
Scale down the “scores”.
DL-08 - Transformers
What is the formula for scaling the scores?
(See image)
DL-08 - Transformers
How do you normalize the scores?
Use softmax to produce attention weights with probability between 0-1.
DL-08 - Transformers
How do you calculate the final attention weights?
Use softmax normalize the scores, to produce attention weights with probability between 0-1.
DL-08 - Transformers
How do you get the output vector of a self-attention unit?
Calculate the self-attention score/weights, then multiply it by the value vector.
DL-08 - Transformers
How do you write the self-attention block as a single matrix operation?
(See image)
DL-08 - Transformers
What is depicted in the image? (See image)
Most of the self-attention block as a matrix operation.
DL-08 - Transformers
What is Multi-headed attention?
A block that uses N different self-attentions (called heads) with different Q, K, V to produce outputs Z_1, Z_2, Z_3, …, Z_n.