Sagemaker Flashcards

Question

What training Data format does DeepAR expect?

Answer 1

Any JSON lines format (Gzip, Parquet)

Answer 2

-Start: starting timestamp -Target: The time series values -Dynamic_feat: A list of dynamic features -Cat: A list of categorical features

Answer 3

-Context_length -Epochs -mini_batch_size -Learning_rate -Num_cells

Answer 4

False, helps only with larger models or large mini_batch_size (>512)

Answer 5

-Text classification -Word (not sentence) embedding generation

Answer 6

-Text classification (supervise mode): text file where each line is a sentence where the first word is __label__ followed by the label. Also accepts augmented manisfest text format. Ex: __label__2 Hello, everyone -Word embedding (unsupervised mode): Just a text file with one sentence per line.

Answer 7

-Cbow -Skip-gram -Batch skip-gram

Answer 8

-Epochs -Learning_rate -Word_ngrams -Vector_dim

Answer 9

-Mode -Learning_rate -Window_size -Vector_dim -Negative_samples

Answer 10

-Skipgram or cbow: Any single CPU or GPU -batch_skipgram: Any single or multiple CPUs instance

Answer 11

-C5 if training dataset < 2GB -Single GPU instance otherwise

Answer 12

It is a model that generalizes word2vec so it can be used on other high dimensional objects

Answer 13

-Sentence-sentence pairs. Ex: "A soccer game with multiple males playing." and "Some men are playing a sport." -Labels-sequence pairs: The genre tags of the movie "Titanic", such as "Romance" and "Drama", and a short description of the movie. -Customer-customer pairs: The customer ID of Jane and customer ID of Jackie. -Product-product pairs: The product ID of football and product ID of basketball. -Item review user-item pairs: A user's ID and the items she has bought, such as apple, pear, and orange.

Answer 14

Object2vec accepts as input pairs of the following input types: -A discrete token, which is represented as a list of a single integer-id. For example, [10]. -A sequences of discrete tokens, which is represented as a list of integer-ids. For example, [0,12,10,13].

Answer 15

But values of the sequence are encoded, compared by a comparator then fed through a feed-forward network

Answer 16

-Average-pooled embeddings -CNNs -Bidirectional LSTMs

Answer 17

Enc1_network and enc2_network, where you choose the encoder type for each of them

Answer 18

INFERENCE_PREFERRED_MODE

Answer 19

They detect and classify objects using bounding boxes and confidence valuesW

Answer 20

-MXNet -Tensorflow

Answer 21

False, only MXNet allows training thorugh AWS, Tensorflow needs you to perform a custom training depending on the model selected

Answer 22

RecordIO or image format (.png or .jgp)

Answer 23

-Learning_rate -Optimizer (sgd, adam, rmsprop, adadelta) -Mini_batch_size

Answer 24

False, inference also accepts CPU

Answer 25

It is an AI Service where images are labeled by an AI model

Answer 26

Object Detection detects the objects inside the image using bounding boxes, while image classification classifies the image as a whole

Answer 27

False, Tensorflow in this case also accepts learning for it's top most layer

Answer 28

-Full training mode (training from scratch) -Transfer learning mode (only top most layer fine-tuned)

Answer 29

-Weight decay -Beta1 -Beta2 -Gamma -Eps

Answer 30

-Training: Single GPU, Multi GPU, multimachine -Inference: Any CPU or GPU

Answer 31

It is a low level analysis of pixels within an image to identify shapes within an image through a segmentation mask

Answer 32

Training: JPG, PNG and Augmented manifest image format Inference: JPG

Answer 33

Augmented manifest image

Answer 34

Algorithm: Fully Convolutional Network (FCN), Pyramid Scene Parsing (PSP), DeepLabV3 Backbones: ResNet50, ResNet101(Both trained on ImageNet

Answer 35

Full training and Incremental Training

Answer 36

-Backbone -Algorithm

Answer 37

-Training: Single GPU, Multi GPU, multimachine -Inference: Any CPU or GPU

Answer 38

It is an unsupervised machine learning algorithm that creates a forest of trees and uses them to detect outliers in the data, assigning an anomaly score to each point

Answer 39

Input: RecordIO-Protobuf and CSV Modes: Pipe and File

Answer 40

-Num_trees (Increasing reduces noise) -Num_samples_per_tree

Answer 41

It is an unsupervised machine learning algorithm that classifies or summarizes documents based on their similarity

Answer 42

-Modes: File and Pipe -Input: CSV and RecordIO-Protobuf

Answer 43

-Mini_batch_size -Learning_rate -Num_topics

Answer 44

-Training: GPU -Inference: CPU

Answer 45

It is an unsupervised topic modelling algorithm that groups similar documents together

Answer 46

-Modes: File and Pipe (Pipe only with RecordIO-Protobuf) -Input: CSV and RecordIO-Protobuf

Answer 47

-Num_topics -Alpha0 (Controls topic sparseness, smaller values produce sparse topics, while larger ones produce uniformly sized topics)

Answer 48

Single CPU instances

Answer 49

-Modes: File and Pipe -Input: CSV and RecordIO-Protobuf

Answer 50

-K -Sample_size

Answer 51

-Training: CPU or GPU -Inference: CPU or GPU for higher throughput on larger batches

Answer 52

-Modes: File and Pipe -Input: CSV and RecordIO-Protobuf

Answer 53

-K -Mini_batch_size -Init_method -Extra_center_factor

Answer 54

CPU or single GPU

Answer 55

It is an unsupervised algorithm that performs dimentionality reduction by projecting vectors into their main principal componets

Answer 56

-Modes: File and Pipe -Input: CSV and RecordIO-Protobuf

Answer 57

-Regular: For Sparse data and a moderate number of observations and features -Randomized: For large number of observations and features

Answer 58

-Mode -Subtract_mean (Unbiases data)

Answer 59

A classification and regression algorithm that works well with sparse data. It's also good with recommendations.

Answer 60

RecordIO-Protobuf (CSV not good for sparse data)

Answer 61

Recommender Systems

Answer 62

CPUs instances, with GPUs being recommended only for dense data

Answer 63

It is a service that uses AI to detect anomalous behaviour based on IP addresses

Answer 64

CSV files with entity -> IP pairs only

Answer 65

False, IP insights is based on nerual networks

Answer 66

It is an implementation of reinforcement learning where you have: -A list of states S -A list of possible actions on those states A -A value of each state/action Q You start of with Q = 0 and explore the space. Each time rewards are given, increase Q. Each time bad things happen, reduce Q.

Answer 67

When using Q learning, it is necessary to explore the space. Because of that, an important question becomes how to determine the actions taken to explore the space.

Answer 68

-Naive approach: Always choose an action with the highest Q (issue: Inefficient and might miss a lot of possible paths) -Better approach: Markov Decision Process (MDP). Introduce epsilon. If random number smaller then epsilon, choose action at random.

Answer 69

Deep Learning frameworks such as MXNet and Tensorflow

Answer 70

It's a Sagemaker functionality where multiple Hyperparameter Tuning Jobs to determine the best hyper parameters for your model

Answer 71

-Use the sagemaker-spark library -Use SparkMagic kernels on Sagemaker notebooks

Answer 72

Random initializes them randomly, while K-means++ tries to initialize the centroids far apart from each other

Answer 73

A pair-wise interaction (Ex: User -> Item)

Answer 74

It is a Sagemaker functionality that helps you train your models by providing debugging features. Some of them include: -Saving your model parameters periodically -Defining rules for unwanted conditions and running debug jobs for each rule -Saving the logs to Cloudwatch -Providing insights on Sagemaker Debugger Dashboard

Answer 75

-Monitor system bottlenecks -Profile model framework operations -Debug model parameters

Answer 76

-MXNet -Tensorflow -XGBoost -Pytorch -Sagemaker generic estimatorsTr

Answer 77

-StopTraining -Email -SMS

Answer 78

It is a Sagemaker functionality that automatizes model selection, model training, data processing and all infrastructure provisioning for training ML models

Answer 79

-Load data from S3 for training -Select column for prediction -Automatic model creation -Model is made available for visibility and control -Model is added to model leaderbords where you can pick the one that suits you the most -You deploy and monitor the model

Answer 80

CSV and Parquet

Answer 81

Binary classification, multiclass classification, regression

Answer 82

-Linear Learner -XGBoost -Deep Learning (MLP) -Ensemble mode

Answer 83

- HPO (Hyperparameter optimization) - Ensemble - Auto

Answer 84

You select the algorithm and range of hyperparameters most relevant for the use case and multiples trails are run by Sagemaker. If the dataset is smaller than 100MB bayesian optimization is used, otherwise multi-fidelity optimization is used

Answer 85

Multiple models are trained together using the AutoGluon library (wider range of models available) and the models are combined using a stacking ensemble method

Answer 86

Autopilot automatically decides a training mode, HPO if training dataset larger than 100MB and Ensemble otherwise. For that, it needs to be able to read the size of your dataset, otherwise it chooses HPO

Answer 87

-S3 bucket hidden on VPC -Over 1000 files on S3 URI passed -S3DataType is a Manifest File

Answer 88

It is a feature that integrates with Sagemaker Clarify and uses SHAP baselines / SHAP values to show how much each feature influences the value being predictedW

Answer 89

It a codeless feature that allows you to analyze model performance in search of anomalies, outliers or data drift

Answer 90

Clarify can help it detect bias in the data (imbalances in features) and explain model behavior

Answer 91

-Class Imbalance -Difference in Proportions of Labels -Kullback-Leibler Divergence (KL), Jensen-Shannon Divergence(JS): How much outcome distributions of facets diverge -Lp-norm (LP): P-norm difference between distributions of outcomes from facets - Total Variation Distance (TVD): L1-norm difference between distributions of outcomes from facets - Total Variation Distance (TVD): L1-norm difference between distributions of outcomes from facets -Conditional Demographic Disparity (CDD): Disparity of outcomes between facets as a whole, and by subgroups

Answer 92

-Tableau -Quicksight -Tensorboard

Answer 93

On S3 and the metrics emitted on Cloudwatch

Answer 94

False, they have to be scheduled via a Monitoring Schedule

Answer 95

-Bias Drift -Data Quality Drift -Model Quality Drift -Feature Attribution Drift

Answer 96

-Asynchronous and real-time endpoints

Answer 97

-False, there are also auto-rollbacks

Answer 98

-Canary Deployment -A/B Deployment -Linear Deployment

Answer 99

You deploy a Shadow Variant of the model in production and compare them. If the results are good, the shadow model can be promoted to production from the Sagemaker Console

Answer 100

It is a functionality from Sagemaker that allows you to deploy pre-existing models with a single click

Answer 101

It is a set of tools that helps importing, exporting, analyzing and transforming data inside Sagemaker Studio

Answer 102

It is a functionality that allows you to store and share features and datasets using specially built repositories

Answer 103

It is a feature that helps with deploying models to Edge locations using models optimized with Sagemaker Neo. Also collects data for monitoring, labelling and retraining

Answer 104

In feature groups, logical groupings of data

Answer 105

False, the data from the offline mode is written to S3

Answer 106

It is a Sagemaker feature that stores your ML Workflows (MLOps) for auditing and compliance

Answer 107

Resource Access Manager

Answer 108

-Trial component: Training job, processing job, etc -Trial: Model composed of trial components -Experiment: A group of trials for a use case -Context: Logical grouping of entities -Action: Workflow step, model deployment -Artifact: Object, data, etc -Association: Connects entities together

Answer 109

Use the Python LineageQuery API, which is part of the Sagemaker SDK. The output of this query is a list of all models/endpoints/etc that use the queried artifact and a visualization showing them

Answer 110

False, it can be accessed through Sagemaker Studio

Answer 111

A functionality that allows Data Wrangler to create and train a model based on the received data

Answer 112

-Make sure Sagemaker Studio has the appropriate permissions to use it -Make sure the data sources have permissions that allow Data Wrangler to access them (AmazonSageMakerFullAccess) -If EC2 instance limit, request quota increase

Answer 113

It is a no code Machine Learning solution for business analysts. It simplifies the training step of the model and facilitates visualization of the data.

Answer 114

It is a feature integrated into AWS Deep Learning Containers. It compiles and optimizes training jobs for GPUs, speeding up training by up to 50%.

Answer 115

Hugging Face Transformers and bring your own model

Answer 116

False, it does not work

Answer 117

-Ensure GPU instances are used (ml.p3, ml.p4) -PyTorch models must use PyTorch/XLA’s model save function -Enable debug flag in compiler_config parameter to enable debugging

Answer 118

False, Sagemaker Canvas cleans the data for you

Answer 119

Classification and Regression

Answer 120

-False, it must be configured “by your IT administrator.” by using an S3 bucket under the hood -True -False, must be updated manually -False, what can be setup is import from Redshift -False, it must be enabled via IAM -True -True

Answer 121

-XGBoost: Does not support Pipe for distributed training -Seq2Seq: Supports only file -DeepAR: Supports only file -Word2Vec: Supports only file -BlazingText: Supports only file -Object2Vec -ObjectDetection: File? -ImageClassification: File? -SemanticSegmentation: File? -FactorizationMachine: File? -IP Insights: File? -LDA: Supports Pipe only for RecordIO

Answer 122

-XGBoost: Also supports Parquet and libsvm. -Seq2Seq: Supports only RecordIO-protobuf (Int) -DeepAR: Supports only JSON line files (GZIP, Parquet) -BlazingText: Txt file or Augmented Manifest Text file -Object2Vec -ObjectDetection: Image format or RecordIO -ImageClassification: Image format or RecordIO -SemanticSegmentation: JPG, PNG, Augmented Image Format -FactorizationMachine: RecordIO-protobuf (Float32) -IP Insights: CSV only -LDA: Supports Pipe only for RecordIO

Answer 123

-Pipe -File -FastFile

Answer 124

-XGBoost -Seq2Seq: Only on the same instance -ObjectDetection -ImageClassification -SemanticSegmentation -IP Insights -Reinforcement Learning

Answer 125

-Linear Learning -XGBoost: Only GPU -DeepAR -BlazingText: Multiple CPUs for batch_skipgram -ObjectDetection -ImageClassification -SemanticSegmentation -Reinforcement Learning