12 LLMs Flashcards

Question 1

Q

What are the three matrices produced from the input in self‑attention?

Answer

A

Query (Q), Key (K), and Value (V) matrices.

Question 2

Q

Write the scaled dot‑product attention formula.

Answer

A

softmax(Q Kᵀ / √d_k)

Question 3

Q

What problem does positional encoding solve in Transformers?

Answer

A

It injects word‑order information that would otherwise be lost when processing all tokens in parallel.

Question 4

Q

Why do Transformers use multi‑head attention instead of a single head?

Answer

A

Multiple heads learn different relational patterns and increase representation capacity by operating in parallel sub‑spaces.

Question 5

Q

True/False: Each attention head chooses one ‘root’ word and attends only to it.

Answer

A

False — every head allows every token to attend to every other token with its own learned weights.

Question 6

Q

List the main components inside one encoder block.

Answer

A

(1) Positional+token embedding, (2) multi‑head self‑attention, (3) add & layer‑norm, (4) position‑wise feed‑forward, (5) add & layer‑norm.

Question 7

Q

What two extra elements appear in a decoder block that are not in the encoder?

Answer

A

(1) Masked self‑attention (causal mask) and (2) cross‑attention that attends to encoder outputs.

Question 8

Q

When does the decoder’s masked self‑attention mask a position?

Answer

A

When the position is to the right (i.e., a future token) of the current token being generated.

Question 9

Q

Define auto‑regressive language modelling in one sentence.

Answer

A

Training a decoder‑only Transformer to predict the next token given all previous tokens in the sequence.

Question 10

Q

State one key limitation of relying only on auto‑regressive pre‑training.

Answer

A

It optimizes fluency, not alignment — the model may produce unsafe or unhelpful text despite grammatical correctness.

Question 11

Q

What is the goal of RLHF fine‑tuning?

Answer

A

To align an LLM with human preferences by using human‑rated outputs as a reward signal in reinforcement learning.

Question 12

Q

Name the two stages in RLHF.

Answer

A

(1) Train a reward model on ranked human preferences; (2) optimize the LLM with policy‑gradient (e.g., PPO) to maximize that reward.

Question 13

Q

LoRA fine‑tuning freezes the base weights; what does it train instead?

Answer

A

Small low‑rank adapter matrices inserted into each weight matrix.

Question 14

Q

What two tricks does QLoRA combine?

Answer

A

(1) 4‑bit weight quantization of the frozen model, (2) LoRA low‑rank adapters for fine‑tuning.

Question 15

Q

Which quantization method is CPU‑optimized for storage and loading?

Question 16

Q

GPTQ’s key idea in one phrase.

Answer

Study These Flashcards

A

Layer‑wise greedy quantization that minimizes error using Hessian information — GPU‑oriented.

Question 17

Q

What makes AWQ different from GPTQ?

Answer

Study These Flashcards

A

It is activation‑aware: it preserves the most activation‑critical weights, enabling faster inference with slight accuracy trade‑off.

Question 18

Q

Memory/computation complexity of self‑attention w.r.t. sequence length n.

Answer

Study These Flashcards

A

O(n²) — attention scores form an n × n matrix.

Question 19

Q

When would you prefer a Transformer over an LSTM?

Answer

Study These Flashcards

A

When parallel training speed and modelling long‑range dependencies outweigh the quadratic memory cost.

Question 20

Q

Give one disadvantage of Transformers compared with CNNs for long sequences.

Answer

Study These Flashcards

A

Self‑attention cost scales quadratically with sequence length, leading to high VRAM use on long inputs.

12 LLMs Flashcards

(20 cards)