L9 - Systems for ML Flashcards
What are the three stages of ML ecosystem? (slide 4)
- Model development
- Training
- Inference
What performance metric is important for training?
throughput
What performance metric is important for interference?
latency
What are the two parts of model development?
- Data part
- Model part
What needs to be done in data part?
short: data collection, cleaning and visualisation
- identify sources of data
- join data from multiple sources
- clean data
- plot trends and anomalies
What resource is the bottleneck for preprocessing data?
CPU
Why do we care about performance of preprocessing?
- affects the end-to-end training time.
- consumes significant CPU and power
What needs to be done in model part?
short: feature engineering, model design; then training and validation
- build informative features
- design new model architectures
- tune hyperparameters
- validate prediction accuracy
What are deep neural networks?
neural networks with multiple hidden layers
What are the three steps of feature extraction and model search?
- feature extraction
- model search
- hyperparameter tuning
What are the three steps of DNN training?
- forward pass: compute activations and loss
- backward pass: compute gradients
- update model weights: to minimise loss
there are iterated over the training dataset
What are the characteristics of DNN training?
- computationally intensive
- error-tolerance
- large training datasets
What is meant under error tolerance?
trade-off some accuracy for large benefits in cost and/or throughput
How can be large training datasets be a problem?
How is this solved?
problem: Data often does not fit in memory
solution: overlap data fetching and training computation
How is single-node training done for DNNs?
- data preprocessing done on the CPU
- DNN training on GPUs/TPUs
What two types of parallelism can be achieved via distributed training?
- data parallelism
- model parallelism
What is data parallelism?
partition the data and run multiple copies of the model
synchronise weight updates
Common ways to implement data parallelism?
- parameter server
- AllReduce (allows GPUs to communicate faster with each other)
What is model parallelism?
partition the model across multiple nodes
note: more complicated
What does each worker do when a parameter server is used?
Each worker has a copy of the full model and trains on a shard of the dataset. Each of them compute forward and backwards. At the end they send their gradients to the parameter server.
What is the task of the PS itself?
It aggregates the gradients it received from each worker. Then computes the new model weight which is sent to each worker.
What is the idea of AllReduce?
- GPU workers exchange updates directly via AllReduce collective
- gradient aggregation is done across workers
List three steps of AllReduce.
- workers computer forward/backward
- reduce-scatter
- all-gather
Why is AllReduce possible? Seems like a lot of communication…
GPU-GPU interconnects are fast