L9 - Systems for ML Flashcards
What are the three stages of ML ecosystem? (slide 4)
- Model development
- Training
- Inference
What performance metric is important for training?
throughput
What performance metric is important for interference?
latency
What are the two parts of model development?
- Data part
- Model part
What needs to be done in data part?
short: data collection, cleaning and visualisation
- identify sources of data
- join data from multiple sources
- clean data
- plot trends and anomalies
What resource is the bottleneck for preprocessing data?
CPU
Why do we care about performance of preprocessing?
- affects the end-to-end training time.
- consumes significant CPU and power
What needs to be done in model part?
short: feature engineering, model design; then training and validation
- build informative features
- design new model architectures
- tune hyperparameters
- validate prediction accuracy
What are deep neural networks?
neural networks with multiple hidden layers
What are the three steps of feature extraction and model search?
- feature extraction
- model search
- hyperparameter tuning
What are the three steps of DNN training?
- forward pass: compute activations and loss
- backward pass: compute gradients
- update model weights: to minimise loss
there are iterated over the training dataset
What are the characteristics of DNN training?
- computationally intensive
- error-tolerance
- large training datasets
What is meant under error tolerance?
trade-off some accuracy for large benefits in cost and/or throughput
How can be large training datasets be a problem?
How is this solved?
problem: Data often does not fit in memory
solution: overlap data fetching and training computation
How is single-node training done for DNNs?
- data preprocessing done on the CPU
- DNN training on GPUs/TPUs
What two types of parallelism can be achieved via distributed training?
- data parallelism
- model parallelism
What is data parallelism?
partition the data and run multiple copies of the model
synchronise weight updates
Common ways to implement data parallelism?
- parameter server
- AllReduce (allows GPUs to communicate faster with each other)
What is model parallelism?
partition the model across multiple nodes
note: more complicated
What does each worker do when a parameter server is used?
Each worker has a copy of the full model and trains on a shard of the dataset. Each of them compute forward and backwards. At the end they send their gradients to the parameter server.
What is the task of the PS itself?
It aggregates the gradients it received from each worker. Then computes the new model weight which is sent to each worker.
What is the idea of AllReduce?
- GPU workers exchange updates directly via AllReduce collective
- gradient aggregation is done across workers
List three steps of AllReduce.
- workers computer forward/backward
- reduce-scatter
- all-gather
Why is AllReduce possible? Seems like a lot of communication…
GPU-GPU interconnects are fast
Why might one need model parallelism?
Some models do not fit on one device.
What is the problem with traditional model parallelism?
low HW utilisation due to sequential dependencies
note: look at slide 33
How can we improve it?
with pipeline parallelism
note: look at slide 34
Can we combine it with data parallelism?
Yes
There are two approaches to distributed training.
What is synchronous training?
workers operate in lockstep to update the model weights at each iteration
What about asynchronous training?
each worker independently updates model weights
Which one offers better model quality?
synchronous
At what cost?
lower throughput
What is the main idea of PipeDream? (async pipeline parallelism)
- workers alternate between forward and backward passes
- gradients are used to update model weights immediately
note: slide 36
What is the issue?
different workers do forward/backward passes on different versions of weights
extra: weight stashing is a solution, gotta read a paper for this
Is the output of model development the trained model?
No.
What can’t be done with just a trained model?
- retrain
- track data and code (for debugging)
- capture dependencies
- audit, e.g. data privacy
What is the output of model development?
training pipeline
What is done in the training step? (remember: second stage of ML ecosystem)
short: iterate through dataset, compute forward & backward pass on each batch of data to update model parameters
- train models on live data
- retrain on new data
- validate accuracy
- manage versioning
What is done in the inference step? (remember: third and last stage of ML ecosystem)
short: compute forward pass on a (small) batch of data
- server models with real data (in real-time)
- embed model serving insides the end-user app
- optimise for low latency
- quantise model weights for memory efficiency
What are the performance metrics for training?
throughput
accuracy
And for inference?
throughput
accuracy
latency
What is a tensor?
a multi-dimensional array with elements having a uniform type
What is a computation graph?
DAG where nodes are mathematical operations and edges are data (tensors) flowing between operators
What is a static computation graph?
The framework first builds the computation graph, then executes it.
And dynamic?
The framework builds the computation graph as it is being executed.
Advantages of static?
more opportunities to optimise if you a complete view at the beginning
Advantages of dynamic?
- intuitive/familiar execution model for programmers
- Can modify and inspect the internals of the graph during runtime, which is helpful for debugging - easies support of dynamic control flow
Why is debugging important for ML?
- Need to catch training problems early in a run, because runs are expensive
- Detecting problems is difficult; bugs manifest as convergence issues, not program crashing
What are the challenges for rewinding/replaying ML training?
- many sources of non-determinism
- hard to efficiently checkpoint all the states