SageMaker Flashcards
What is SageMaker?
- Complete ML lifecycle service. Can comprise tools (e.g. notebooks), code, infrastructure management (e.g. EC2 instances, load balancing), algorithms, managed services
- Three stages of SageMaker:
- Build: preprocessing, ground truth, notebooks
- Train: built-in algorithms, hyperparameter tuning, notebooks, infrastructure
- Deploy: realtime, batch, notebooks, infrastructure, neo
Control
- Controlled through the AWS console, SDK, or Jupyter notebooks
- Large list of actions within the AWS API
- Python SDK: use of boto3 and sagemaker libraries - the library which contains all the API calls for SageMaker
Notebooks
- Set-up: set names, compute instance, IAM role to give permission to notebook for S3 buckets, network (auto internet access, can also set up VPC)
- Open through Jupyter notebooks or JupyterLab
- Console communicates with AWS API via unique URL
- Within the notebook interface we can leverage SageMaker examples/algorithms
- Lifecycle configurations: “bootstrap” scripts can be used on the notebooks
Data preprocessing in SageMaker
- Exploration through visualisation, tables, images etc
- Feature engineering: if data small then Jupyter notebook, otherwise could be through EMR/Spark out of warehouse
- Synthesise data: anomalies, skews, missing data, imputation
- Alter data structure, converting data types, schema
- Splitting (train, test)
- Use of SM notebooks as the central area to engage with other AWS services (e.g. Quicksight, S3, EMR)
- Could also use SM algorithms such as PCA and k-means to assist with pre-processing
Ground Truth
“Build highly accurate training data sets using ML and reduce data labelling costs by up to 70%”
- Data is fed into an existing ML model, which is supported by people (AWS or external)
- Amazon mechanical turk: crowdsources micro-tasks to people who label data, then labels are fed back into the model which continues to learn and label
- Can define work instructions (e.g. drawing bound box)
- Eventually produces labelled data set for training
SageMaker algorithms selection
- SM built-in algorithms (as a part of SM)
- AWS marketplace - algos that have already been trained on cars etc. Available where we would purchase AMIs etc
- Custom algorithms
SageMaker built-in algorithm examples 1
- BlazingText: word2vec/text classification, NLP, sentiment analysis, named entity recognition, machine translation. Algorithm behind Comprehend
- Image Classification Algorithm: CNN, image recognition, used by Rekognition
- K-means: based off the web-scale k-means clustering algorithm, find discrete groupings within our data
SageMaker built-in algorithm examples 2
- LDA: text analysis, topic discovery, Comprehend
- PCA: dimensionality reduction (i.e. most influential features)
- XGBoost: eXtreme Gradient Boosting, Gradient boosted trees, good for making predictions from tabular data
- Random Cut Forest (RCF): unsupervised algorithm for detecting anomalous data points within a data set
SageMaker training: Common SageMaker architecture (built-in algorithm)
SageMaker pulls a built-in model from a docker container and the data from S3 to train on an EC2 instance and provide the model back to S3
SageMaker training: ML Docker Containers
- SageMaker built-in algorithms: a managed container repository within ECS)
- AWS deep learning containers
- AWS marketplace - provided via Docker containers
- Custom docker containers - must adhere to a certain structure (see notes)
- /opt/ml
SageMaker training: miscellaneous concepts
- FSx for Lustre: high performance data interface for S3
- Channel parameters: e.g. train channel, validation channel, testing channel. Built-in algorithms expect different channels
- Input modes: files or pipes. For thousands of files coming out of S3 we might use a pipe model
- Instance type selection: some models require GPU
- Parallelizable: the model can be split and run in parallel
- Models are stored in /opt/ml/model
SageMaker training: Managed Spot Training
- Optimising the cost of training models over the cost of on-demand instances
- Can define checkpoints within training (i.e. snapshot of model training at a certain time)
- Bid for price of an EC2 instance sold on the spot market
SageMaker training: Automatic hyperparameter tuning
- We define a range on our hyperparameters (e.g. epochs between 20 and 40) and then let SageMaker do the tuning to find the optimum
- SM will initiate several training jobs, cycling through to find the best combination of hyperparameters
- This is carried out by a “tuning model” which is an ML model used for optimising hyperparameters
- Works with built-in algorithms, custom algorithms, pre-built containers. However there are limits and we need to be mindful of EC2 resource limits
SageMaker training: SageMaker Neo
- Designed to “free a model from its framework” and allows it to make best use of the hardware on which it is deployed
- Takes a model built on a framework and converts it into portable code (i.e compiles to agnostic ML model)
- Model optimised to 2x faster without loss in accuracy
- After being compiled, 100x memory footprint reduction as its not dependent on a framework
- Kind of like a ML model container
SageMaker deploy: Inference Pipelines
- We may wish to “chain” several models together
- A type of SageMaker model that is composed of a linear sequence of two to five containers that process requests for inferences on data
- Invocations are handled as a sequence of HTTP requests. The first container in the pipeline handles the request then passes onto the second container etc. The final response is provided to the client
- You can use an inference pipeline to combine preprocessing, predictions, and post-processing data science tasks
E.g. PCA to reduce dimensionality, then linear learner