Sagemaker Built-in Algorithms Flashcards
Linear Learner
linear regression
can handle both regression and classification
for classification, a linear threshold is used
Linear Learner Input Format
recordIO/protobuf, csv
file or pipe mode supported
Linear Learner Usage
preprocessing: data must be normalized and shuffled training: choose optimization function alg multiple models optimized in parallel tune L1, L2 regularization
XGBoost
eXtreme Gradient Boosting
boosted group of decision trees
gradient descent to minimize loss
can be used for classification and regression
XGBoost Input
CSV, libsvm
recently recordIO/protobuf, Parquet
XGBoost Usage
Models are serialized/deserialized with Pickle
can be used within notebook or as a built in SM algorithm
HPs: Subsample, eta, gamma, alpha, lambda
Only uses CPUs, only memory bound
Seq2Seq
Input is a sequence of tokens, output is a sequence of tokens
good for machine translation, text summarization, speech to text
Seq2Seq Input
recordIO/protobuf - tokens must be integers
start with tokenized text files
NEED TO PROVIDE TRAINING DATA, VALIDATION DATA, AND VOCAB FILES
Seq2Seq Usage
Training can take days
Pretrained models available
Public training datasets available for specific translation tasks
HPs: batch, optimizer, # layers
can optimize on accuracy, BLEU score, perplexity
only uses single machine GPU
DeepAR
forcasting 1D time-series data
uses RNNs
allows you to train the same model on several related time series
finds frequency and seasonality
DeepAR Input
JSON lines (gzip or parquet)
each record must contain: start, target
can contain dynamic/categorical features
DeepAR Usage
- always include entire time series
- uses entire dataset, remove last points for training
- don’t use very large values for prediction length
- train on many time series’ when possible
HPs: epochs, batch size, learning rate, # cells, context length
GPU or CPU for training, CPU only for inference
BlazingText
- Text Classification
predict labels for a sentence (NOT DOCS)
supervised
ex. web search, info retrieval - Word2Vec
- vector representation of words
- semantically similar words represented by vectors
close to each other > word embedding
- useful for NLP, but not an NLP algorithm
- only works on INDIVIDUAL words
BlazingText Input
- Text Class. (supervised mode)
- 1 sentence / line
- 1st word in sentence is label “label”
- augmented manifest text format - Word2Vec
- text file with 1 sentence / line
BlazingText Usage
Word2Vec has multiple modes:
- cbow > continuous bag of words (order doesn't matter) - skip-gram (order matters) - batch skip-gram (distributed over CPU nodes)
HPs:
- Word2Vec: mode, learning rate, window size, vector dim, negative samples
- Text Classification: epochs, learning rate, word n-grams, vector dim (how many words we look at together)
cbow and skipgram use GPU (can use CPU)
batch skipgram use single or multiple CPU
Text class - use CPU for smaller, GPU for larger
Object2Vec
- like Word2Vec but with arbitrary objects
- boil data down to lower level representation
- compute nearest neighbors, visualize clusters, genre prediction, recommendations
- UNSUPERVISED
Object2Vec Input
- tokenized into integers
- pairs or sequences of tokens
- sentence-sentence, labels-sequence, customer-customer, product-product, user-item
Object2Vec Usage
- process data into JSON and shuffle
- train with 2 input channels, 2 encoders, 1 comparator
- encoder choices:
- average pooled embeddings, CNN, bidirectional LSTM
- comparator is followed by a feed-forward neural network
HPs: usual deep learning ones:
- dropout, early stopping, epochs, learning rate, batch size, layers, activation function, optimizer, weight decay - encoder1 network, encoder2 network
single machine, multi GPU
use inference pref mode to optimize for encoder embeddings rather than classification or regression
Object Detection
- ID all images in an image with bounding boxes
- detect and classify with one deep neural network
- provide confidence scores
- can train images from scratch, or use pretrained models on ImageNet
Object Detection Input
- recordIO or image format (need JSON file for annotation data)
Object Detection Usage
- image»_space; outputs all instances of all objects in image with categories and confidence scores
- CNN with SSD algorithm
- VGG-16 or ResNet 50
- transfer learning mode / incremental training
- use pretrained model for base network weights instead of random initial rates
- uses flip, rescale, and jitter to avoid overfitting
HPs: batch size, learning rate, optimizer
GPU for training, CPU for inference
Image Classification
- assign one or more labels to an image
- doesn’t tell you where the objects are (no bounding)
Image Classification Input
- MxNet Record10 (not protobuf!)
- raw images
- requires first file to associate image with labels
- augmented manifest image format - pipe mode!
Image Classification Usage
- ResNet CNN
- full training > initialized with random weights
- transfer learning mode:
- initialized with pretrained weights
- top layer is initialized with random weights
- network is fine-tuned with new training data
- default size is 3channel 224*224
HPs: batch, learning rate, optimizers (weight decay, beta1, beta2, eps, gamma)
GPU for training, GPU or CPU for instance
Semantic Segmentation
- pixel-level object classification
- useful for self-driving cars
- produces a segmentation mask
Semantic Segmentation Input
- JPG images and PNG annotations
- label maps for describing annotations
- augmented manifest image format for pipe!
- JPG for inference
Semantic Segmentation Usage
- built on MxNet Gluon and GluonCV
- choice of 3 algorithms:
- fully-convolutional network (FCN)
- pyramid scene parsing (PSP)
- DeepLabV3
- backbone: ResNet50, ResNet10
- both trained on ImageNet
HPs: epochs, learning rate, batch size, optimizer, algorithm used, backbone used
single machine GPU only, CPU or GPU for inference
Random Cut Forest
- unsupervised anomaly detection
- detect
- spikes in time-series data
- breaks in periodicity
- unclassifiable data points
- gives anomaly score to each point
- amazon very proud of this!
Random Cut Forest Inputs
- CSV or recordIO/protobuf
- file or pipe
- optional test channel for computing AUC, recall, precision, F1 score
Random Cut Forest Usage
- create forest of trees where each tree is a partition of the training data
- looks at expected change in complexity as a result of adding a new point
- sampled randomly, then trained
- can be used on time series
HPs: number of trees (increasing # reduces noise), samples / tree
no GPU
Neural Topic Model
- organize documents into topics
- classify/summarize documents based on topics
- not just TF/IDF
- NTM groups things into higher levels
- unsupervised
- uses a neural variational inference algorithm
Neural Topic Model Input
- four data channels
- train channel required (validation, test, aux optional)
- recordIO/protobuf or CSV
- words need to be tokenized with a vocab file
- file or pipe mode
Neural Topic Model Usage
- define how many topics to generate
- latent representation based on top-ranking words
- one of two topic modeling algorithms (LDA)
HPs: smaller batch size and learning rate can reduce validation loss but increase training time, # of topics
CPU or GPU
Latent Dirichlet Allocation (LDA)
- topic modeling (not deep learning)
- unsupervised
- grouping of documents with shared subset of words
- can be used for things other than words
- customer clusters, harmonic analysis
LDA Input
- train, optional test channel
- recordIO/protobuf or CSV - need to tokenize
- each document has counts for every word in vocab (CSV)
- pipe only supported with recordIO
LDA Usage
- unsupervised > you pick the # of topics
- test channel - score results
- functionally similar to Neural Topic Modeling, but CPU based
HPs: # of topics, Alpha0 (initial guess for concentration values)
single instance CPU
KNN
- supervised
- simple classification or regression algorithm
- classification:
- find K closest points to a sample and return most frequent label
- regression:
- find K closest points to a sample and return average value
KNN Input
- train, optional test channel
- recordIO/protobuf or CSV
- file or pipe
KNN Usage
- data is sampled
- dimensionality reduction
- avoid sparse data at the cost of noise/accuracy
- sign or figit methods
- build index
- serialize
- query
HPs: K, sample size
CPU or GPU
inference - CPU for lower latency, GPU for higher throughput
K-Means
- unsupervised clustering
- divide data into K groups where members are similar
- you define “similar” > euclidian distance
- web-scale k-means clustering
K-means Inputs
- train channel (sharded by S3 key flag), optional test (fully replicated key flag)
- recordIO/protobuf or CSV
- file and pipe
K-Means Usage
- every observation mapped to n-dimensional space
- works to optimize center of K-clusters
- extra cluster centers may be specified to improve accuracy
- K = k*x
- k = clusters we want
- x = extra cluster centers
- algorithm: determine initial cluster centers
- random or K-means ++
- K-means ++ tries to make initial clusters far apart
- random or K-means ++
- iterate over data and calculate cluster centers
- reduce clusters from K to k (using Lloyd’s method for k-means++)
HPs: batch size, extra center factor (x), init method (random or k-means++), K
- K is tricky: use elbow method - basically optimize for tightness of clusters
CPU or GPU (CPU recommended)
Principal Component Analysis (PCA)
- dimensionality reduction
- projecting higher-level dimensional data into lower-dimensional (like a 2D plot) while minimizing loss of info
- reduced dimensions are called components
- first component has largest possible variability
- 2nd component has next largest
- unsupervised
PCA Inputs
- recordIO/protobuf
- file or pipe
PCA Usage
- covariance matrix created, then singular value decomposition (SVD)
- 2 modes:
- regular - sparse data, moderate # of features
- randomized - large # of features
- uses approximation algorithm
HPs: algorithm mode, subtract mean (unbias data)
CPU or GPU - depends on specifics of data
Factorization Machines
- classification/regression with SPARSE DATA
- good for recommendations
- click prediction
- item recommendations
- since a user doesn’t interact with most pages/products, the data is sparse
- supervised (classification or regression)
- limited to pair-wise interactions
- user - item
Factorization Machines Inputs
- recordIO/protobuf with Float32
- sparse data means CSV isn’t practical
Factorization Machines Usage
- essentially makes a big matrix
- find factors we can use to predict a classification (click or not) or value (predicted rating) given a matrix representing some pair of things (users and items)
HPs: initialization methods for bias, factors, and linear terms
- uniform, normal, or constant
CPU or GPU - CPU recommended, GPU only works with dense data
IP Insights
- unsupervised
- learning of IP address usage patterns
- ID suspicious activity
- security tool
IP Insights Inputs
- user names, account IDs can be fed in directly
- training channel, optional validation (computes AUC)
- CSV only
- entity, IP
IP Insights Usage
- neural network to learn about latent vector representations of entities and IP
- entities hashed and embedded
- need big enough hash size
- auto generates negative samples by randomly pairing entities and IPs
HPs: # of entity vectors (hash size), vector dim, epochs, learning rate, batch size
CPU or GPU (GPU recommended)