12 LLMs Flashcards
What are the three matrices produced from the input in self‑attention?
Query (Q), Key (K), and Value (V) matrices.
Write the scaled dot‑product attention formula.
softmax(Q Kᵀ / √d_k)
What problem does positional encoding solve in Transformers?
It injects word‑order information that would otherwise be lost when processing all tokens in parallel.
Why do Transformers use multi‑head attention instead of a single head?
Multiple heads learn different relational patterns and increase representation capacity by operating in parallel sub‑spaces.
True/False: Each attention head chooses one ‘root’ word and attends only to it.
False — every head allows every token to attend to every other token with its own learned weights.
List the main components inside one encoder block.
(1) Positional+token embedding, (2) multi‑head self‑attention, (3) add & layer‑norm, (4) position‑wise feed‑forward, (5) add & layer‑norm.
What two extra elements appear in a decoder block that are not in the encoder?
(1) Masked self‑attention (causal mask) and (2) cross‑attention that attends to encoder outputs.
When does the decoder’s masked self‑attention mask a position?
When the position is to the right (i.e., a future token) of the current token being generated.
Define auto‑regressive language modelling in one sentence.
Training a decoder‑only Transformer to predict the next token given all previous tokens in the sequence.
State one key limitation of relying only on auto‑regressive pre‑training.
It optimizes fluency, not alignment — the model may produce unsafe or unhelpful text despite grammatical correctness.
What is the goal of RLHF fine‑tuning?
To align an LLM with human preferences by using human‑rated outputs as a reward signal in reinforcement learning.
Name the two stages in RLHF.
(1) Train a reward model on ranked human preferences; (2) optimize the LLM with policy‑gradient (e.g., PPO) to maximize that reward.
LoRA fine‑tuning freezes the base weights; what does it train instead?
Small low‑rank adapter matrices inserted into each weight matrix.
What two tricks does QLoRA combine?
(1) 4‑bit weight quantization of the frozen model, (2) LoRA low‑rank adapters for fine‑tuning.
Which quantization method is CPU‑optimized for storage and loading?
GGUF.
GPTQ’s key idea in one phrase.
Layer‑wise greedy quantization that minimizes error using Hessian information — GPU‑oriented.
What makes AWQ different from GPTQ?
It is activation‑aware: it preserves the most activation‑critical weights, enabling faster inference with slight accuracy trade‑off.
Memory/computation complexity of self‑attention w.r.t. sequence length n.
O(n²) — attention scores form an n × n matrix.
When would you prefer a Transformer over an LSTM?
When parallel training speed and modelling long‑range dependencies outweigh the quadratic memory cost.
Give one disadvantage of Transformers compared with CNNs for long sequences.
Self‑attention cost scales quadratically with sequence length, leading to high VRAM use on long inputs.