Transformer Deck Flashcards

Ai Knowledge

1
Q

Q: What is a layer?

A

A: A processing stage in a neural network that transforms input data. In Transformers, there are multiple layers stacked on top of each other (6 in the base model).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Q: What is a sub-layer?

A

A: A component within a layer that performs a specific function. In Transformers, each layer has:

Self-attention sub-layer
Feed-forward network sub-layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Q: What is an embedding?

A

A: A learned vector representation of input tokens (words/subwords) that converts them into numerical vectors the model can process. In Transformers, embeddings have dimension d_model (512 in base model).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Q: What are Queries (Q), Keys (K), and Values (V)?

A

A: The three components used in attention mechanism:

Queries: What we’re searching for
Keys: What we’re matching against
Values: What we actually retrieve
Think of it like looking up a word in a dictionary where:
Query = word you’re looking up
Key = dictionary entries
Value = definitions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Q: What is dot-product attention?

A

A: A method to compute attention by:

Taking dot product of query and key vectors
Applying softmax to get weights
Using weights to compute weighted sum of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Q: What is masked attention?

A

A: A modification of attention used in the decoder that prevents positions from attending to subsequent positions. This ensures the model can’t see future tokens during training/inference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Q: What is d_model?

A

A: The dimensionality of the model’s internal representations (512 in base model). This is:

Size of embeddings
Size of each layer’s output
Size of attention mechanism’s output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Q: What is d_ff?

A

A: The dimensionality of the feed-forward network’s inner layer (2048 in base model). This is larger than d_model to allow for more complex transformations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Q: What is h (number of heads)?

A

A: The number of parallel attention mechanisms (8 in base model). Each head can focus on different aspects of the input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Q: What is label smoothing?

A

A: A regularization technique that prevents the model from becoming over-confident by:

Softening the hard targets (1s and 0s)
Reserving some probability for incorrect classes
Value used: ε_ls = 0.1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Q: What is warmup steps?

A

A: Initial training steps where learning rate increases linearly before decreasing. Used to:

Stabilize early training
Prevent early divergence
Value used: 4000 steps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Q: What is beam search?

A

A: A decoding strategy that:

Keeps track of multiple possible output sequences
Chooses the most probable sequence
Uses beam size (number of sequences to track) and length penalty
Values used: beam size = 4, α = 0.6

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Q: What is BLEU?

A

A: Bilingual Evaluation Understudy Score:

Metric for evaluating translation quality
Compares model output to human references
Higher scores indicate better translations
Ranges from 0 to 100

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Q: What are FLOPs?

A

A: Floating Point Operations:

Measure of computational work
Used to compare training efficiency
Lower is better for same performance level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Q: What is the “base model” vs “big model”?

A

A: Two variants of Transformer:
Base model:

d_model = 512
h = 8 heads
6 layers
65M parameters

Big model:

d_model = 1024
h = 16 heads
6 layers
213M parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Q: What is byte-pair encoding (BPE)?

A

A: A vocabulary creation method that:

Breaks words into subword units
Balances vocabulary size and coverage
Helps handle rare words
Used in paper: ~37,000 tokens for English-German

17
Q

Q: What is layer normalization?

A

A: Technique to normalize activations:

Applied after each sub-layer
Helps with training stability
Operates across feature dimension
Formula: LayerNorm(x + Sublayer(x))