Deployment Flashcards
Batch prediction
Periodically run your model on a new data and cache the results in a data base
Works if the universe of inputs is relatively small (eg 1 prediction per user)
Pros:
Simple to implement
Low latency to the user
Cons:
Doesn’t scale to complex inputs types
User don’t get the most up to date predictions
Hard to detect model staleness, and it happens frequently
Model in service
Package up your model and include it in your deployed web server
Web server loads the model and calls it to make predictions (store the weights on the web server, or on s3 and download when needed)
Pros:
Re used your existing infrastructure
Cons: Different coding language Models change more frequently Eat resources Hardware not optimized - no GPU Most important - web server and the model may scale differently
Model as service
Most common deployment
Run your model on its own web server
The backend interact with the model by making requests
Pros:
Dependable. Model bugs less likely to crash the web app
Scalable (pick optimal hardware)
Flexibility - easily reuse a model across multiple apps
Cons:
Add latency
Infrastructure complexity
Now you have to run a model service (the ml engineer…)
REST APIs
Serving predictions in response to canonical formatted HTTP requests
Alternatives: GRPC , GraphQL
Dependency management for model server
Model predictions depend on code, weights, and dependencies. All need to be on your web server
Hard to make consistent, hard to update
2 strategies:
- Constrain the dependencies for the model
- Use containers
GPU or no GPU
Pros:
Same hardware as in training
Usually higher throughput
Cons:
More complex
More expansive - not the norm
Concurrency - what is it?
What?
Multiple copies of the model running on different cpus or cores
How?
Be carful about thread tuning - make sure it’s tuning in minimal threads it needs
What is Model distillation?
Train a smaller model to imitate your larger one
Can be finicky to do yourself, not used in practice so often
Exception - distilBERT
What is quantization?
Used for reduce the size and increase the speed:
What?
Execute some or all of the operations in your model with a smaller numerical representation than floats (eg int8)
Some trade offs with accuracy
How?
PyTorch and tensor flow lite have quantization built in
Can also run quantization aware training, which often results in higher accuracy
Caching (in model deployment)
Performance optimization.
What?
For some ml models some inputs are more common.
Instead of running the model again on them first check the cache
How?
Can get very fancy.
Basic way uses Python built in functools (.cache)
Batching (in model deployment)
What?
Ml model often achieve higher throughput when doing predictions in parallel
How?
Collect requests until you have a batch, run prediction, return to users.
Batch size needs to be tuned. (Throughput vs latency)
Have a shortcut for when latency is too long
Probably don’t want to implement yourself
Sharing the gpu
What?
You model may not take up the whole gpu so share it
How?
Use a model servicing solution that supports this out of the box (very hard to implement)
How to do horizontal scaling?
If you have too much traffic you can split it among multiple machines
Spin up multiple copies of the service.
2 common methods:
- Container orchestration (kubernetes)
- Serverless (aws lambda)
What is model deployment?
Serving the model is how you turn a model into something that can respond to requests, deployment is how you rollout manage and update these services.
How?
Gradually, instantly and with deploy pipelines of models
Hopefully your deployment liberty will take care for you.