ML | Transformers | Basics | Priority Flashcards

1
Q

Explain the intuition of attention using a time series.

attention self-attention transformers

A

(See source material.)

attention self-attention transformers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain the basic idea of self-attention for text using word embeddings.

attention self-attention transformers

A

(See source material.)

attention self-attention transformers

Source: Self-attention > 14:00

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the what queries, keys, and values are in self-attention using a diagram.

attention self-attention transformers

A

(See source material.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Write out a schematic diagram of a self-attention block.

attention self-attention transformers

A

(See source material.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

(From source) What parts of the self attention block are trainable when we have keys, queries and values?

attention self-attention transformers

A

(Brian) Matrices for keys, queries, and values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are 2 approaches for positional encodings (high-level ideas)?

attention self-attention transformers positional-embeddings

A

(See source material.)

Source: Transformers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

(from source) What is the purpose of the positional encoding in the transformer architecture?

attention self-attention transformers positional-embeddings

A

(Brian) To enforce some attention to the position of tokens.

Source: Transformers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

(From source) Why are transformers easier to parallelize than recurrent neural networks?

attention self-attention transformers

A

(Brian) Because different attention heads don’t depend on each other and can be processed in parallel.

Source: Transformers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the equation for attention computed over mini-batches?

attention transformers

A

(See source material.)

Source: Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 15.44

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the basic equation for scaled dot product attention? Why is it scaled?

attention transformers

A

(See source material.) To ensure the variance of the inner product remains 1 regardless of the size of the inputs, it is standard to divide by sqrt(d).

Source: Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 15.43

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the basic equation for attention?

attention transformers

A

(See source material.)

Source: Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 15.34

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Pseudocode for transformer encoder block.

attention self-attention transformers

A
def EncoderBlock(X):
    Z = LayerNorm(MultiHeadAttn(Q=X, K=X, V=X) + X)
    E = LayerNorm(FeedForward(Z) + Z)
    return E

Source: Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. 15.5.4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Draw a diagram compariong 1D CNNs, RNNs, and self-attention.

CNNs RNNs transformers

A

(See source material.)

Source: Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. Figure 15.27

How well did you know this?
1
Not at all
2
3
4
5
Perfectly