12 LLMs Flashcards

1
Q

What are the three matrices produced from the input in self‑attention?

A

Query (Q), Key (K), and Value (V) matrices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Write the scaled dot‑product attention formula.

A

softmax(Q Kᵀ / √d_k)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What problem does positional encoding solve in Transformers?

A

It injects word‑order information that would otherwise be lost when processing all tokens in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why do Transformers use multi‑head attention instead of a single head?

A

Multiple heads learn different relational patterns and increase representation capacity by operating in parallel sub‑spaces.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

True/False: Each attention head chooses one ‘root’ word and attends only to it.

A

False — every head allows every token to attend to every other token with its own learned weights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

List the main components inside one encoder block.

A

(1) Positional+token embedding, (2) multi‑head self‑attention, (3) add & layer‑norm, (4) position‑wise feed‑forward, (5) add & layer‑norm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What two extra elements appear in a decoder block that are not in the encoder?

A

(1) Masked self‑attention (causal mask) and (2) cross‑attention that attends to encoder outputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When does the decoder’s masked self‑attention mask a position?

A

When the position is to the right (i.e., a future token) of the current token being generated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define auto‑regressive language modelling in one sentence.

A

Training a decoder‑only Transformer to predict the next token given all previous tokens in the sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

State one key limitation of relying only on auto‑regressive pre‑training.

A

It optimizes fluency, not alignment — the model may produce unsafe or unhelpful text despite grammatical correctness.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the goal of RLHF fine‑tuning?

A

To align an LLM with human preferences by using human‑rated outputs as a reward signal in reinforcement learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name the two stages in RLHF.

A

(1) Train a reward model on ranked human preferences; (2) optimize the LLM with policy‑gradient (e.g., PPO) to maximize that reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

LoRA fine‑tuning freezes the base weights; what does it train instead?

A

Small low‑rank adapter matrices inserted into each weight matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What two tricks does QLoRA combine?

A

(1) 4‑bit weight quantization of the frozen model, (2) LoRA low‑rank adapters for fine‑tuning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which quantization method is CPU‑optimized for storage and loading?

A

GGUF.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

GPTQ’s key idea in one phrase.

A

Layer‑wise greedy quantization that minimizes error using Hessian information — GPU‑oriented.

17
Q

What makes AWQ different from GPTQ?

A

It is activation‑aware: it preserves the most activation‑critical weights, enabling faster inference with slight accuracy trade‑off.

18
Q

Memory/computation complexity of self‑attention w.r.t. sequence length n.

A

O(n²) — attention scores form an n × n matrix.

19
Q

When would you prefer a Transformer over an LSTM?

A

When parallel training speed and modelling long‑range dependencies outweigh the quadratic memory cost.

20
Q

Give one disadvantage of Transformers compared with CNNs for long sequences.

A

Self‑attention cost scales quadratically with sequence length, leading to high VRAM use on long inputs.