ML | Transformers | Basics | Priority Flashcards

Question 1

Q

Explain the intuition of attention using a time series.

attention self-attention transformers

Answer

A

(See source material.)

attention self-attention transformers

Source: Self-attention

Question 2

Q

Explain the basic idea of self-attention for text using word embeddings.

attention self-attention transformers

Answer

A

(See source material.)

attention self-attention transformers

Source: Self-attention > 14:00

Question 3

Q

Explain the what queries, keys, and values are in self-attention using a diagram.

attention self-attention transformers

Answer

A

(See source material.)

Source: Keys, Values, and Queries > 9:00

Question 4

Q

Write out a schematic diagram of a self-attention block.

attention self-attention transformers

Answer

A

(See source material.)

Source: Keys, Values, and Queries 11:00

Question 5

Q

(From source) What parts of the self attention block are trainable when we have keys, queries and values?

attention self-attention transformers

Answer

A

(Brian) Matrices for keys, queries, and values.

Source: Keys, Values, and Queries

Question 6

Q

What are 2 approaches for positional encodings (high-level ideas)?

attention self-attention transformers positional-embeddings

Answer

A

(See source material.)

Source: Transformers

Question 7

Q

(from source) What is the purpose of the positional encoding in the transformer architecture?

attention self-attention transformers positional-embeddings

Answer

A

(Brian) To enforce some attention to the position of tokens.

Source: Transformers

Question 8

Q

(From source) Why are transformers easier to parallelize than recurrent neural networks?

attention self-attention transformers

Answer

A

(Brian) Because different attention heads don’t depend on each other and can be processed in parallel.

Source: Transformers

Question 9

Q

What is the equation for attention computed over mini-batches?

attention transformers

Answer

A

(See source material.)

Source: Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 15.44

Question 10

Q

What is the basic equation for scaled dot product attention? Why is it scaled?

attention transformers

Answer

A

(See source material.) To ensure the variance of the inner product remains 1 regardless of the size of the inputs, it is standard to divide by sqrt(d).

Source: Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 15.43

Question 11

Q

What is the basic equation for attention?

attention transformers

Answer

A

(See source material.)

Source: Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 15.34

Question 12

Q

Pseudocode for transformer encoder block.

attention self-attention transformers

Answer

A

def EncoderBlock(X):
    Z = LayerNorm(MultiHeadAttn(Q=X, K=X, V=X) + X)
    E = LayerNorm(FeedForward(Z) + Z)
    return E

Source: Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 15.5.4

Question 13

Q

Draw a diagram compariong 1D CNNs, RNNs, and self-attention.

CNNs RNNs transformers

Answer

A

(See source material.)

Source: Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. Figure 15.27

ML | Transformers | Basics | Priority Flashcards

(13 cards)