L9 - Systems for ML Flashcards by Unknown Unknown

What are the three stages of ML ecosystem? (slide 4)

Model development
Training
Inference

How well did you know this?

Not at all

Perfectly

What performance metric is important for training?

throughput

How well did you know this?

Not at all

Perfectly

What performance metric is important for interference?

latency

How well did you know this?

Not at all

Perfectly

What are the two parts of model development?

Data part
Model part

How well did you know this?

Not at all

Perfectly

What needs to be done in data part?

short: data collection, cleaning and visualisation

identify sources of data
join data from multiple sources
clean data
plot trends and anomalies

How well did you know this?

Not at all

Perfectly

What resource is the bottleneck for preprocessing data?

CPU

How well did you know this?

Not at all

Perfectly

Why do we care about performance of preprocessing?

affects the end-to-end training time.
consumes significant CPU and power

How well did you know this?

Not at all

Perfectly

What needs to be done in model part?

short: feature engineering, model design; then training and validation

build informative features
design new model architectures
tune hyperparameters
validate prediction accuracy

How well did you know this?

Not at all

Perfectly

What are deep neural networks?

neural networks with multiple hidden layers

How well did you know this?

Not at all

Perfectly

What are the three steps of feature extraction and model search?

feature extraction
model search
hyperparameter tuning

How well did you know this?

Not at all

Perfectly

What are the three steps of DNN training?

forward pass: compute activations and loss
backward pass: compute gradients
update model weights: to minimise loss

there are iterated over the training dataset

How well did you know this?

Not at all

Perfectly

What are the characteristics of DNN training?

computationally intensive
error-tolerance
large training datasets

How well did you know this?

Not at all

Perfectly

What is meant under error tolerance?

trade-off some accuracy for large benefits in cost and/or throughput

How well did you know this?

Not at all

Perfectly

How can be large training datasets be a problem?

How is this solved?

problem: Data often does not fit in memory

solution: overlap data fetching and training computation

How well did you know this?

Not at all

Perfectly

How is single-node training done for DNNs?

data preprocessing done on the CPU
DNN training on GPUs/TPUs

How well did you know this?

Not at all

Perfectly

What two types of parallelism can be achieved via distributed training?

data parallelism
model parallelism

How well did you know this?

Not at all

Perfectly

What is data parallelism?

partition the data and run multiple copies of the model

synchronise weight updates

How well did you know this?

Not at all

Perfectly

Common ways to implement data parallelism?

parameter server
AllReduce (allows GPUs to communicate faster with each other)

How well did you know this?

Not at all

Perfectly

What is model parallelism?

partition the model across multiple nodes

note: more complicated

How well did you know this?

Not at all

Perfectly

What does each worker do when a parameter server is used?

Study These Flashcards

Each worker has a copy of the full model and trains on a shard of the dataset. Each of them compute forward and backwards. At the end they send their gradients to the parameter server.

What is the task of the PS itself?

Study These Flashcards

It aggregates the gradients it received from each worker. Then computes the new model weight which is sent to each worker.

What is the idea of AllReduce?

Study These Flashcards

GPU workers exchange updates directly via AllReduce collective
gradient aggregation is done across workers

List three steps of AllReduce.

Study These Flashcards

workers computer forward/backward
reduce-scatter
all-gather

Why is AllReduce possible? Seems like a lot of communication…

Study These Flashcards

GPU-GPU interconnects are fast

Why might one need model parallelism?

Some models do not fit on one device.

What is the problem with traditional model parallelism?

low HW utilisation due to sequential dependencies note: look at slide 33

How can we improve it?

with pipeline parallelism note: look at slide 34

Can we combine it with data parallelism?

Yes

There are two approaches to distributed training. What is synchronous training?

workers operate in lockstep to update the model weights at each iteration

What about asynchronous training?

each worker independently updates model weights

Which one offers better model quality?

synchronous

At what cost?

lower throughput

What is the main idea of PipeDream? (async pipeline parallelism)

- workers alternate between forward and backward passes - gradients are used to update model weights immediately note: slide 36

What is the issue?

different workers do forward/backward passes on different versions of weights extra: weight stashing is a solution, gotta read a paper for this

Is the output of model development the trained model?

No.

What can't be done with just a trained model?

1. retrain 2. track data and code (for debugging) 3. capture dependencies 4. audit, e.g. data privacy

What is the output of model development?

training pipeline

What is done in the training step? (remember: second stage of ML ecosystem)

short: iterate through dataset, compute forward & backward pass on each batch of data to update model parameters 1. train models on live data 2. retrain on new data 3. validate accuracy 4. manage versioning

What is done in the inference step? (remember: third and last stage of ML ecosystem)

short: compute forward pass on a (small) batch of data 1. server models with real data (in real-time) 2. embed model serving insides the end-user app 3. optimise for low latency 4. quantise model weights for memory efficiency

What are the performance metrics for training?

throughput accuracy

And for inference?

throughput accuracy latency

What is a tensor?

a multi-dimensional array with elements having a uniform type

What is a computation graph?

DAG where nodes are mathematical operations and edges are data (tensors) flowing between operators

What is a static computation graph?

The framework first builds the computation graph, then executes it.

And dynamic?

The framework builds the computation graph as it is being executed.

Advantages of static?

more opportunities to optimise if you a complete view at the beginning

Advantages of dynamic?

1. intuitive/familiar execution model for programmers - Can modify and inspect the internals of the graph during runtime, which is helpful for debugging 2. easies support of dynamic control flow

Why is debugging important for ML?

1. Need to catch training problems early in a run, because runs are expensive 2. Detecting problems is difficult; bugs manifest as convergence issues, not program crashing

What are the challenges for rewinding/replaying ML training?

1. many sources of non-determinism 2. hard to efficiently checkpoint all the states

L9 - Systems for ML Flashcards

(49 cards)