SageMaker Flashcards

Question 1

Q

What is SageMaker?

Answer

A

Complete ML lifecycle service. Can comprise tools (e.g. notebooks), code, infrastructure management (e.g. EC2 instances, load balancing), algorithms, managed services
Three stages of SageMaker:
1. Build: preprocessing, ground truth, notebooks
2. Train: built-in algorithms, hyperparameter tuning, notebooks, infrastructure
3. Deploy: realtime, batch, notebooks, infrastructure, neo

Question 2

Q

Control

Answer

A

Controlled through the AWS console, SDK, or Jupyter notebooks
Large list of actions within the AWS API
Python SDK: use of boto3 and sagemaker libraries - the library which contains all the API calls for SageMaker

Question 3

Q

Notebooks

Answer

A

Set-up: set names, compute instance, IAM role to give permission to notebook for S3 buckets, network (auto internet access, can also set up VPC)
Open through Jupyter notebooks or JupyterLab
Console communicates with AWS API via unique URL
Within the notebook interface we can leverage SageMaker examples/algorithms
Lifecycle configurations: “bootstrap” scripts can be used on the notebooks

Question 4

Q

Data preprocessing in SageMaker

Answer

A

Exploration through visualisation, tables, images etc
Feature engineering: if data small then Jupyter notebook, otherwise could be through EMR/Spark out of warehouse
Synthesise data: anomalies, skews, missing data, imputation
Alter data structure, converting data types, schema
Splitting (train, test)
Use of SM notebooks as the central area to engage with other AWS services (e.g. Quicksight, S3, EMR)
Could also use SM algorithms such as PCA and k-means to assist with pre-processing

Question 5

Q

Ground Truth

Answer

A

“Build highly accurate training data sets using ML and reduce data labelling costs by up to 70%”

Data is fed into an existing ML model, which is supported by people (AWS or external)
Amazon mechanical turk: crowdsources micro-tasks to people who label data, then labels are fed back into the model which continues to learn and label
Can define work instructions (e.g. drawing bound box)
Eventually produces labelled data set for training

Question 6

Q

SageMaker algorithms selection

Answer

A

SM built-in algorithms (as a part of SM)
AWS marketplace - algos that have already been trained on cars etc. Available where we would purchase AMIs etc
Custom algorithms

Question 7

Q

SageMaker built-in algorithm examples 1

Answer

A

BlazingText: word2vec/text classification, NLP, sentiment analysis, named entity recognition, machine translation. Algorithm behind Comprehend
Image Classification Algorithm: CNN, image recognition, used by Rekognition
K-means: based off the web-scale k-means clustering algorithm, find discrete groupings within our data

Question 8

Q

SageMaker built-in algorithm examples 2

Answer

A

LDA: text analysis, topic discovery, Comprehend
PCA: dimensionality reduction (i.e. most influential features)
XGBoost: eXtreme Gradient Boosting, Gradient boosted trees, good for making predictions from tabular data
Random Cut Forest (RCF): unsupervised algorithm for detecting anomalous data points within a data set

Question 9

Q

SageMaker training: Common SageMaker architecture (built-in algorithm)

Answer

A

SageMaker pulls a built-in model from a docker container and the data from S3 to train on an EC2 instance and provide the model back to S3

Question 10

Q

SageMaker training: ML Docker Containers

Answer

A

SageMaker built-in algorithms: a managed container repository within ECS)
AWS deep learning containers
AWS marketplace - provided via Docker containers
Custom docker containers - must adhere to a certain structure (see notes)
/opt/ml

Question 11

Q

SageMaker training: miscellaneous concepts

Answer

A

FSx for Lustre: high performance data interface for S3
Channel parameters: e.g. train channel, validation channel, testing channel. Built-in algorithms expect different channels
Input modes: files or pipes. For thousands of files coming out of S3 we might use a pipe model
Instance type selection: some models require GPU
Parallelizable: the model can be split and run in parallel
Models are stored in /opt/ml/model

Question 12

Q

SageMaker training: Managed Spot Training

Answer

A

Optimising the cost of training models over the cost of on-demand instances
Can define checkpoints within training (i.e. snapshot of model training at a certain time)
Bid for price of an EC2 instance sold on the spot market

Question 13

Q

SageMaker training: Automatic hyperparameter tuning

Answer

A

We define a range on our hyperparameters (e.g. epochs between 20 and 40) and then let SageMaker do the tuning to find the optimum
SM will initiate several training jobs, cycling through to find the best combination of hyperparameters
This is carried out by a “tuning model” which is an ML model used for optimising hyperparameters
Works with built-in algorithms, custom algorithms, pre-built containers. However there are limits and we need to be mindful of EC2 resource limits

Question 14

Q

SageMaker training: SageMaker Neo

Answer

A

Designed to “free a model from its framework” and allows it to make best use of the hardware on which it is deployed
Takes a model built on a framework and converts it into portable code (i.e compiles to agnostic ML model)
Model optimised to 2x faster without loss in accuracy
After being compiled, 100x memory footprint reduction as its not dependent on a framework
Kind of like a ML model container

Question 15

Q

SageMaker deploy: Inference Pipelines

Answer

A

We may wish to “chain” several models together
A type of SageMaker model that is composed of a linear sequence of two to five containers that process requests for inferences on data
Invocations are handled as a sequence of HTTP requests. The first container in the pipeline handles the request then passes onto the second container etc. The final response is provided to the client
You can use an inference pipeline to combine preprocessing, predictions, and post-processing data science tasks
E.g. PCA to reduce dimensionality, then linear learner

Question 16

Q

SageMaker deploy: Real-time inference

Answer

Study These Flashcards

A

Data in S3 and containerised model from ECR feed into the model, which is accessed by a SageMaker endpoint (not accessible globally)
To access the SageMaker endpoint, we have an API call: InvokeEndpoint from servers, a mobile app or Lambda

Question 17

Q

SageMaker deploy: Batch inference

Answer

Study These Flashcards

A

Model is stored in S3/docker container. Inference data is fed in from an S3 bucket and results are stored in another S3 location
Amazon SM Batch Transform can be used to get predictions for entire data sets. This is done as opposed to SM endpoints which are part of Amazon SM hosting services and are used to set up a persistent endpoint to get one prediction at a time. This is not the optimal method for processing large sets of data

Question 18

Q

SageMaker deploy: Elastic Inference

Answer

Study These Flashcards

A

A resource that can be attached to EC2 CPU instances to accelerate deep learning inference workloads
Allows you to attach GPU-powered acceleration to many Amazon machine instances in order to reduce the cost of running deep learning inference (up to 75%)
Supports tensorflow, MXNet, ONNX

Question 19

Q

SageMaker deploy: Accessing inference from apps

Answer

Study These Flashcards

A

AWS API/SDK required
To call inference from a simple website/app, we require the API Gateway to grant access
This links to a Lambda function which accesses our SageMaker Endpoint

Question 20

Q

What are the different ways that Docker containers are used by SageMaker?

Answer

Study These Flashcards

A

Use a built-in algorithm: the existing SageMaker algorithms are packaged in dockerfiles in the background
Using pre-built container images that support the deep learning frameworks such as TF, PyTorch, Mxnet etc with their dependencies
Extending a pre-built container image by making your own adjustments to the image to satisfy your needs
Bring your own customer container image, if there is none existing in SageMaker that fulfils your requirements (e.g. R models)

Question 21

Q

What are Docker containers?

Answer

Study These Flashcards

A

Docker is a program that performs operating-system-level virtualization for installing, distributing, and managing software. It packages applications and their dependencies into virtual containers that provide isolation, portability, and security.
You can put scripts, algorithms, and inference code for your machine learning models into containers. The container includes the runtime, system tools, system libraries, and other code required to train your algorithms or deploy your models.

Question 22

Q

Deploying your Docker container

Answer

Study These Flashcards

A

Install sagemaker-containers within the dockerfile
Define the entry point containing code to run when the container is started
Push image to the Elastic Container Registry (ECR)
Pull from ECR and build the container

SageMaker Flashcards

(22 cards)