LLM – Scaling, Computation Cost Flashcards

Question 1

Q

What can you do to optimize the training of LLMs?

Answer

A

Distribute some stuff:

Data Parallelism
Pipeline Parallelism

Question 2

Q

Explain Data Parallelism in LLM training

Answer

A

Shard data: each GPU stores a piece of the dataset. At each backpropagation cycle, each GPU computes the local gradients and they are sent to one coordinator server (usually one GPU) that aggregates these gradients into one and updates the LM, and then the coordinator sends the LM back to GPUs.

Question 3

Q

Explain Pipeline Parallelism in LLM training

Answer

A

Helps with long computation time and memory. Essentially instead of just splitting the data, we split the model and train pieces of the model separately into multiple GPUs. GPU1 does first part, then the GPU2 does the second part etc. The backpropagation is done in reverse order.

Question 4

Q

What is the problem with Pipeline parallelism and how to solve it?

Answer

A

GPUs are idle most of the time.

We can split the data into mini-batches. For example, if we have 32 examples, they can be split into 4 mini-batches of 8. They are fed to GPU1 and when GPU1 is done, it sends it to GPU2 but then GPU1 takes the next mini-batch. It reduces the idle time but still, it is not true paralelism

Question 5

Q

What is quantization?

Answer

A

It is a technique to save memory space when training/doing inference with LLMs by storing or performing computation on 4/8 bit integers instead of 16/32 bit floating point numbers. It can be combined with knowledge distillation or pruning (removing excessive model weights to lower the parameter count).

By definition, quantization is the process of mapping input values from a large set (often a continuous set) to
output values in a (countable) smaller set, often with a finite number of elements.

Question 6

Q

What are the parameters when doing linear quantization?

Answer

A

r - floating-point matrix of original 32-bit floats
q - quantized integer matrix
z - integer that shifts the quantized matrix
s - factor that you multiply the quantized matrix (shifted) with to get the reconstructed 32-bit matrix

Question 7

Q

How to find the parameters S and Z of linear quantization?

Answer

A

When mapping from a floating-point range, we have to shift the scale and also multiply with some S to scale down/up the floating-point range to fit the integer scale. By doing some calculus, we can derive formulas for S and Z.

Question 8

Q

What is FLOPS?

Answer

A

Floating point operations per second (FLOPS, flops or flop/s).

Each FLOP can represent an addition, subtraction, multiplication, or division of floating-point numbers.

The total FLOP of a model (e.g., Transformer) provides a basic approximation of computational costs associated with that model.

Question 9

Q

How many FLOPS operations is needed for multiplying matrix with a vector?

Answer

A

2mn (2 * the size of the matrix). 2 because 1 for multiplication and 1 for addition

Question 10

Q

How many FLOPS operations is needed for multiplying matrix with a matrix? A = Rmxn, B=Rnxp (sizes of matrices)

Answer

A

2mnp (sizes of both matrices). 2 because 1 for multiplication and 1 for addition

Question 11

Q

How do we calculate the forward and backwards computation cost for LLM training?

Answer

A

Forward: FLOPs for backward pass is roughly twice
of forward pass.

Forward: 2batch_size|W|
Backward: 4batch_size|W|

Training FLOPs for multiplying by a matrix W = 6 x (batch size) x (size of W)

This does not include all computation but it is a rough estimate. We exclude:
- Bias vector addition
▪ layer normalization
▪ residual connections
▪ non-linearities
▪Softmax

Question 12

Q

Consider HyperCLOVA, an 82B parameter model that was pre-trained on 150B tokens, using a cluster of 1024 A100 GPUs. The peak throughput of A100 GPUs is 312 teraFLOPS or 3.12 × pow(10, 14). How long will the training take?

Question 13

Q

Why the training time estimate is usually slightly off compared to the real time?

Answer

A

Theoretical peak throughput is not achievable with distributed training. (unless your model only does large matrix multiplications).
We ignored many additional operations like softmax, ReLU/GeLU activations, self-attention, Layer Norm etc.
Training divergence and restarting from earlier checkpoints are not uncommon.
Communication latency, memory bandwidth, caching, etc