Transformer Deck Flashcards
Ai Knowledge
Q: What is a layer?
A: A processing stage in a neural network that transforms input data. In Transformers, there are multiple layers stacked on top of each other (6 in the base model).
Q: What is a sub-layer?
A: A component within a layer that performs a specific function. In Transformers, each layer has:
Self-attention sub-layer
Feed-forward network sub-layer
Q: What is an embedding?
A: A learned vector representation of input tokens (words/subwords) that converts them into numerical vectors the model can process. In Transformers, embeddings have dimension d_model (512 in base model).
Q: What are Queries (Q), Keys (K), and Values (V)?
A: The three components used in attention mechanism:
Queries: What we’re searching for
Keys: What we’re matching against
Values: What we actually retrieve
Think of it like looking up a word in a dictionary where:
Query = word you’re looking up
Key = dictionary entries
Value = definitions
Q: What is dot-product attention?
A: A method to compute attention by:
Taking dot product of query and key vectors
Applying softmax to get weights
Using weights to compute weighted sum of values
Q: What is masked attention?
A: A modification of attention used in the decoder that prevents positions from attending to subsequent positions. This ensures the model can’t see future tokens during training/inference.
Q: What is d_model?
A: The dimensionality of the model’s internal representations (512 in base model). This is:
Size of embeddings
Size of each layer’s output
Size of attention mechanism’s output
Q: What is d_ff?
A: The dimensionality of the feed-forward network’s inner layer (2048 in base model). This is larger than d_model to allow for more complex transformations.
Q: What is h (number of heads)?
A: The number of parallel attention mechanisms (8 in base model). Each head can focus on different aspects of the input.
Q: What is label smoothing?
A: A regularization technique that prevents the model from becoming over-confident by:
Softening the hard targets (1s and 0s)
Reserving some probability for incorrect classes
Value used: ε_ls = 0.1
Q: What is warmup steps?
A: Initial training steps where learning rate increases linearly before decreasing. Used to:
Stabilize early training
Prevent early divergence
Value used: 4000 steps
Q: What is beam search?
A: A decoding strategy that:
Keeps track of multiple possible output sequences
Chooses the most probable sequence
Uses beam size (number of sequences to track) and length penalty
Values used: beam size = 4, α = 0.6
Q: What is BLEU?
A: Bilingual Evaluation Understudy Score:
Metric for evaluating translation quality
Compares model output to human references
Higher scores indicate better translations
Ranges from 0 to 100
Q: What are FLOPs?
A: Floating Point Operations:
Measure of computational work
Used to compare training efficiency
Lower is better for same performance level
Q: What is the “base model” vs “big model”?
A: Two variants of Transformer:
Base model:
d_model = 512
h = 8 heads
6 layers
65M parameters
Big model:
d_model = 1024
h = 16 heads
6 layers
213M parameters