Transformer Deck Flashcards

Question 1

Q

Q: What is a layer?

Answer

A

A: A processing stage in a neural network that transforms input data. In Transformers, there are multiple layers stacked on top of each other (6 in the base model).

Question 2

Q

Q: What is a sub-layer?

Answer

A

A: A component within a layer that performs a specific function. In Transformers, each layer has:

Self-attention sub-layer
Feed-forward network sub-layer

Question 3

Q

Q: What is an embedding?

Answer

A

A: A learned vector representation of input tokens (words/subwords) that converts them into numerical vectors the model can process. In Transformers, embeddings have dimension d_model (512 in base model).

Question 4

Q

Q: What are Queries (Q), Keys (K), and Values (V)?

Answer

A

A: The three components used in attention mechanism:

Queries: What we’re searching for
Keys: What we’re matching against
Values: What we actually retrieve
Think of it like looking up a word in a dictionary where:
Query = word you’re looking up
Key = dictionary entries
Value = definitions

Question 5

Q

Q: What is dot-product attention?

Answer

A

A: A method to compute attention by:

Taking dot product of query and key vectors
Applying softmax to get weights
Using weights to compute weighted sum of values

Question 6

Q

Q: What is masked attention?

Answer

A

A: A modification of attention used in the decoder that prevents positions from attending to subsequent positions. This ensures the model can’t see future tokens during training/inference.

Question 7

Q

Q: What is d_model?

Answer

A

A: The dimensionality of the model’s internal representations (512 in base model). This is:

Size of embeddings
Size of each layer’s output
Size of attention mechanism’s output

Question 8

Q

Q: What is d_ff?

Answer

A

A: The dimensionality of the feed-forward network’s inner layer (2048 in base model). This is larger than d_model to allow for more complex transformations.

Question 9

Q

Q: What is h (number of heads)?

Answer

A

A: The number of parallel attention mechanisms (8 in base model). Each head can focus on different aspects of the input.

Question 10

Q

Q: What is label smoothing?

Answer

A

A: A regularization technique that prevents the model from becoming over-confident by:

Softening the hard targets (1s and 0s)
Reserving some probability for incorrect classes
Value used: ε_ls = 0.1

Question 11

Q

Q: What is warmup steps?

Answer

A

A: Initial training steps where learning rate increases linearly before decreasing. Used to:

Stabilize early training
Prevent early divergence
Value used: 4000 steps

Question 12

Q

Q: What is beam search?

Answer

A

A: A decoding strategy that:

Keeps track of multiple possible output sequences
Chooses the most probable sequence
Uses beam size (number of sequences to track) and length penalty
Values used: beam size = 4, α = 0.6

Question 13

Q

Q: What is BLEU?

Answer

A

A: Bilingual Evaluation Understudy Score:

Metric for evaluating translation quality
Compares model output to human references
Higher scores indicate better translations
Ranges from 0 to 100

Question 14

Q

Q: What are FLOPs?

Answer

A

A: Floating Point Operations:

Measure of computational work
Used to compare training efficiency
Lower is better for same performance level

Question 15

Q

Q: What is the “base model” vs “big model”?

Answer

A

A: Two variants of Transformer:
Base model:

d_model = 512
h = 8 heads
6 layers
65M parameters

Big model:

d_model = 1024
h = 16 heads
6 layers
213M parameters

Question 16

Q

Q: What is byte-pair encoding (BPE)?

Answer

Study These Flashcards

A

A: A vocabulary creation method that:

Breaks words into subword units
Balances vocabulary size and coverage
Helps handle rare words
Used in paper: ~37,000 tokens for English-German

Question 17

Q

Q: What is layer normalization?

Answer

Study These Flashcards

A

A: Technique to normalize activations:

Applied after each sub-layer
Helps with training stability
Operates across feature dimension
Formula: LayerNorm(x + Sublayer(x))

Transformer Deck Flashcards

Ai Knowledge (17 cards)