L9 - Systems for ML Flashcards

1
Q

What are the three stages of ML ecosystem? (slide 4)

A
  1. Model development
  2. Training
  3. Inference
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What performance metric is important for training?

A

throughput

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What performance metric is important for interference?

A

latency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the two parts of model development?

A
  1. Data part
  2. Model part
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What needs to be done in data part?

A

short: data collection, cleaning and visualisation

  1. identify sources of data
  2. join data from multiple sources
  3. clean data
  4. plot trends and anomalies
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What resource is the bottleneck for preprocessing data?

A

CPU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why do we care about performance of preprocessing?

A
  1. affects the end-to-end training time.
  2. consumes significant CPU and power
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What needs to be done in model part?

A

short: feature engineering, model design; then training and validation

  1. build informative features
  2. design new model architectures
  3. tune hyperparameters
  4. validate prediction accuracy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are deep neural networks?

A

neural networks with multiple hidden layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the three steps of feature extraction and model search?

A
  1. feature extraction
  2. model search
  3. hyperparameter tuning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the three steps of DNN training?

A
  1. forward pass: compute activations and loss
  2. backward pass: compute gradients
  3. update model weights: to minimise loss

there are iterated over the training dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the characteristics of DNN training?

A
  1. computationally intensive
  2. error-tolerance
  3. large training datasets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is meant under error tolerance?

A

trade-off some accuracy for large benefits in cost and/or throughput

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can be large training datasets be a problem?

How is this solved?

A

problem: Data often does not fit in memory

solution: overlap data fetching and training computation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is single-node training done for DNNs?

A
  1. data preprocessing done on the CPU
  2. DNN training on GPUs/TPUs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What two types of parallelism can be achieved via distributed training?

A
  1. data parallelism
  2. model parallelism
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is data parallelism?

A

partition the data and run multiple copies of the model

synchronise weight updates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Common ways to implement data parallelism?

A
  1. parameter server
  2. AllReduce (allows GPUs to communicate faster with each other)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is model parallelism?

A

partition the model across multiple nodes

note: more complicated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does each worker do when a parameter server is used?

A

Each worker has a copy of the full model and trains on a shard of the dataset. Each of them compute forward and backwards. At the end they send their gradients to the parameter server.

21
Q

What is the task of the PS itself?

A

It aggregates the gradients it received from each worker. Then computes the new model weight which is sent to each worker.

22
Q

What is the idea of AllReduce?

A
  • GPU workers exchange updates directly via AllReduce collective
  • gradient aggregation is done across workers
23
Q

List three steps of AllReduce.

A
  1. workers computer forward/backward
  2. reduce-scatter
  3. all-gather
24
Q

Why is AllReduce possible? Seems like a lot of communication…

A

GPU-GPU interconnects are fast

25
Q

Why might one need model parallelism?

A

Some models do not fit on one device.

26
Q

What is the problem with traditional model parallelism?

A

low HW utilisation due to sequential dependencies

note: look at slide 33

27
Q

How can we improve it?

A

with pipeline parallelism

note: look at slide 34

28
Q

Can we combine it with data parallelism?

A

Yes

29
Q

There are two approaches to distributed training.

What is synchronous training?

A

workers operate in lockstep to update the model weights at each iteration

30
Q

What about asynchronous training?

A

each worker independently updates model weights

31
Q

Which one offers better model quality?

A

synchronous

32
Q

At what cost?

A

lower throughput

33
Q

What is the main idea of PipeDream? (async pipeline parallelism)

A
  • workers alternate between forward and backward passes
  • gradients are used to update model weights immediately

note: slide 36

34
Q

What is the issue?

A

different workers do forward/backward passes on different versions of weights

extra: weight stashing is a solution, gotta read a paper for this

35
Q

Is the output of model development the trained model?

A

No.

36
Q

What can’t be done with just a trained model?

A
  1. retrain
  2. track data and code (for debugging)
  3. capture dependencies
  4. audit, e.g. data privacy
37
Q

What is the output of model development?

A

training pipeline

38
Q

What is done in the training step? (remember: second stage of ML ecosystem)

A

short: iterate through dataset, compute forward & backward pass on each batch of data to update model parameters

  1. train models on live data
  2. retrain on new data
  3. validate accuracy
  4. manage versioning
39
Q

What is done in the inference step? (remember: third and last stage of ML ecosystem)

A

short: compute forward pass on a (small) batch of data

  1. server models with real data (in real-time)
  2. embed model serving insides the end-user app
  3. optimise for low latency
  4. quantise model weights for memory efficiency
40
Q

What are the performance metrics for training?

A

throughput
accuracy

41
Q

And for inference?

A

throughput
accuracy
latency

42
Q

What is a tensor?

A

a multi-dimensional array with elements having a uniform type

43
Q

What is a computation graph?

A

DAG where nodes are mathematical operations and edges are data (tensors) flowing between operators

44
Q

What is a static computation graph?

A

The framework first builds the computation graph, then executes it.

45
Q

And dynamic?

A

The framework builds the computation graph as it is being executed.

46
Q

Advantages of static?

A

more opportunities to optimise if you a complete view at the beginning

47
Q

Advantages of dynamic?

A
  1. intuitive/familiar execution model for programmers
    - Can modify and inspect the internals of the graph during runtime, which is helpful for debugging
  2. easies support of dynamic control flow
48
Q

Why is debugging important for ML?

A
  1. Need to catch training problems early in a run, because runs are expensive
  2. Detecting problems is difficult; bugs manifest as convergence issues, not program crashing
49
Q

What are the challenges for rewinding/replaying ML training?

A
  1. many sources of non-determinism
  2. hard to efficiently checkpoint all the states