ML | Transformers | Basics | Priority Flashcards
Explain the intuition of attention using a time series.
attention self-attention transformers
Explain the basic idea of self-attention for text using word embeddings.
attention self-attention transformers
Explain the what queries, keys, and values are in self-attention using a diagram.
attention self-attention transformers
(See source material.)
Source: Keys, Values, and Queries > 9:00
Write out a schematic diagram of a self-attention block.
attention self-attention transformers
(See source material.)
Source: Keys, Values, and Queries 11:00
(From source) What parts of the self attention block are trainable when we have keys, queries and values?
attention self-attention transformers
(Brian) Matrices for keys, queries, and values.
Source: Keys, Values, and Queries
What are 2 approaches for positional encodings (high-level ideas)?
attention self-attention transformers positional-embeddings
(See source material.)
Source: Transformers
(from source) What is the purpose of the positional encoding in the transformer architecture?
attention self-attention transformers positional-embeddings
(Brian) To enforce some attention to the position of tokens.
Source: Transformers
(From source) Why are transformers easier to parallelize than recurrent neural networks?
attention self-attention transformers
(Brian) Because different attention heads don’t depend on each other and can be processed in parallel.
Source: Transformers
What is the equation for attention computed over mini-batches?
attention transformers
(See source material.)
Source: Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 15.44
What is the basic equation for scaled dot product attention? Why is it scaled?
attention transformers
(See source material.) To ensure the variance of the inner product remains 1 regardless of the size of the inputs, it is standard to divide by sqrt(d).
Source: Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 15.43
What is the basic equation for attention?
attention transformers
(See source material.)
Source: Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 15.34
Pseudocode for transformer encoder block.
attention self-attention transformers
def EncoderBlock(X): Z = LayerNorm(MultiHeadAttn(Q=X, K=X, V=X) + X) E = LayerNorm(FeedForward(Z) + Z) return E
Source: Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 15.5.4
Draw a diagram compariong 1D CNNs, RNNs, and self-attention.
CNNs RNNs transformers
(See source material.)
Source: Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. Figure 15.27