SageMaker Flashcards

1
Q

What is SageMaker?

A
  • Complete ML lifecycle service. Can comprise tools (e.g. notebooks), code, infrastructure management (e.g. EC2 instances, load balancing), algorithms, managed services
  • Three stages of SageMaker:
    1. Build: preprocessing, ground truth, notebooks
    2. Train: built-in algorithms, hyperparameter tuning, notebooks, infrastructure
    3. Deploy: realtime, batch, notebooks, infrastructure, neo
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Control

A
  • Controlled through the AWS console, SDK, or Jupyter notebooks
  • Large list of actions within the AWS API
  • Python SDK: use of boto3 and sagemaker libraries - the library which contains all the API calls for SageMaker
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Notebooks

A
  • Set-up: set names, compute instance, IAM role to give permission to notebook for S3 buckets, network (auto internet access, can also set up VPC)
  • Open through Jupyter notebooks or JupyterLab
  • Console communicates with AWS API via unique URL
  • Within the notebook interface we can leverage SageMaker examples/algorithms
  • Lifecycle configurations: “bootstrap” scripts can be used on the notebooks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data preprocessing in SageMaker

A
  • Exploration through visualisation, tables, images etc
  • Feature engineering: if data small then Jupyter notebook, otherwise could be through EMR/Spark out of warehouse
  • Synthesise data: anomalies, skews, missing data, imputation
  • Alter data structure, converting data types, schema
  • Splitting (train, test)
  • Use of SM notebooks as the central area to engage with other AWS services (e.g. Quicksight, S3, EMR)
  • Could also use SM algorithms such as PCA and k-means to assist with pre-processing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Ground Truth

A

“Build highly accurate training data sets using ML and reduce data labelling costs by up to 70%”

  • Data is fed into an existing ML model, which is supported by people (AWS or external)
  • Amazon mechanical turk: crowdsources micro-tasks to people who label data, then labels are fed back into the model which continues to learn and label
  • Can define work instructions (e.g. drawing bound box)
  • Eventually produces labelled data set for training
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

SageMaker algorithms selection

A
  1. SM built-in algorithms (as a part of SM)
  2. AWS marketplace - algos that have already been trained on cars etc. Available where we would purchase AMIs etc
  3. Custom algorithms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

SageMaker built-in algorithm examples 1

A
  • BlazingText: word2vec/text classification, NLP, sentiment analysis, named entity recognition, machine translation. Algorithm behind Comprehend
  • Image Classification Algorithm: CNN, image recognition, used by Rekognition
  • K-means: based off the web-scale k-means clustering algorithm, find discrete groupings within our data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

SageMaker built-in algorithm examples 2

A
  • LDA: text analysis, topic discovery, Comprehend
  • PCA: dimensionality reduction (i.e. most influential features)
  • XGBoost: eXtreme Gradient Boosting, Gradient boosted trees, good for making predictions from tabular data
  • Random Cut Forest (RCF): unsupervised algorithm for detecting anomalous data points within a data set
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

SageMaker training: Common SageMaker architecture (built-in algorithm)

A

SageMaker pulls a built-in model from a docker container and the data from S3 to train on an EC2 instance and provide the model back to S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

SageMaker training: ML Docker Containers

A
  • SageMaker built-in algorithms: a managed container repository within ECS)
  • AWS deep learning containers
  • AWS marketplace - provided via Docker containers
  • Custom docker containers - must adhere to a certain structure (see notes)
  • /opt/ml
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

SageMaker training: miscellaneous concepts

A
  • FSx for Lustre: high performance data interface for S3
  • Channel parameters: e.g. train channel, validation channel, testing channel. Built-in algorithms expect different channels
  • Input modes: files or pipes. For thousands of files coming out of S3 we might use a pipe model
  • Instance type selection: some models require GPU
  • Parallelizable: the model can be split and run in parallel
  • Models are stored in /opt/ml/model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

SageMaker training: Managed Spot Training

A
  • Optimising the cost of training models over the cost of on-demand instances
  • Can define checkpoints within training (i.e. snapshot of model training at a certain time)
  • Bid for price of an EC2 instance sold on the spot market
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

SageMaker training: Automatic hyperparameter tuning

A
  • We define a range on our hyperparameters (e.g. epochs between 20 and 40) and then let SageMaker do the tuning to find the optimum
  • SM will initiate several training jobs, cycling through to find the best combination of hyperparameters
  • This is carried out by a “tuning model” which is an ML model used for optimising hyperparameters
  • Works with built-in algorithms, custom algorithms, pre-built containers. However there are limits and we need to be mindful of EC2 resource limits
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

SageMaker training: SageMaker Neo

A
  • Designed to “free a model from its framework” and allows it to make best use of the hardware on which it is deployed
  • Takes a model built on a framework and converts it into portable code (i.e compiles to agnostic ML model)
  • Model optimised to 2x faster without loss in accuracy
  • After being compiled, 100x memory footprint reduction as its not dependent on a framework
  • Kind of like a ML model container
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

SageMaker deploy: Inference Pipelines

A
  • We may wish to “chain” several models together
  • A type of SageMaker model that is composed of a linear sequence of two to five containers that process requests for inferences on data
  • Invocations are handled as a sequence of HTTP requests. The first container in the pipeline handles the request then passes onto the second container etc. The final response is provided to the client
  • You can use an inference pipeline to combine preprocessing, predictions, and post-processing data science tasks
    E.g. PCA to reduce dimensionality, then linear learner
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

SageMaker deploy: Real-time inference

A
  • Data in S3 and containerised model from ECR feed into the model, which is accessed by a SageMaker endpoint (not accessible globally)
  • To access the SageMaker endpoint, we have an API call: InvokeEndpoint from servers, a mobile app or Lambda
17
Q

SageMaker deploy: Batch inference

A
  • Model is stored in S3/docker container. Inference data is fed in from an S3 bucket and results are stored in another S3 location
  • Amazon SM Batch Transform can be used to get predictions for entire data sets. This is done as opposed to SM endpoints which are part of Amazon SM hosting services and are used to set up a persistent endpoint to get one prediction at a time. This is not the optimal method for processing large sets of data
18
Q

SageMaker deploy: Elastic Inference

A
  • A resource that can be attached to EC2 CPU instances to accelerate deep learning inference workloads
  • Allows you to attach GPU-powered acceleration to many Amazon machine instances in order to reduce the cost of running deep learning inference (up to 75%)
  • Supports tensorflow, MXNet, ONNX
19
Q

SageMaker deploy: Accessing inference from apps

A
  • AWS API/SDK required
  • To call inference from a simple website/app, we require the API Gateway to grant access
  • This links to a Lambda function which accesses our SageMaker Endpoint
20
Q

What are the different ways that Docker containers are used by SageMaker?

A
  • Use a built-in algorithm: the existing SageMaker algorithms are packaged in dockerfiles in the background
  • Using pre-built container images that support the deep learning frameworks such as TF, PyTorch, Mxnet etc with their dependencies
  • Extending a pre-built container image by making your own adjustments to the image to satisfy your needs
  • Bring your own customer container image, if there is none existing in SageMaker that fulfils your requirements (e.g. R models)
21
Q

What are Docker containers?

A
  • Docker is a program that performs operating-system-level virtualization for installing, distributing, and managing software. It packages applications and their dependencies into virtual containers that provide isolation, portability, and security.
  • You can put scripts, algorithms, and inference code for your machine learning models into containers. The container includes the runtime, system tools, system libraries, and other code required to train your algorithms or deploy your models.
22
Q

Deploying your Docker container

A
  • Install sagemaker-containers within the dockerfile
  • Define the entry point containing code to run when the container is started
  • Push image to the Elastic Container Registry (ECR)
  • Pull from ECR and build the container