Sagemaker Built-in Algorithms Flashcards
Linear Learner
linear regression
can handle both regression and classification
for classification, a linear threshold is used
Linear Learner Input Format
recordIO/protobuf, csv
file or pipe mode supported
Linear Learner Usage
preprocessing: data must be normalized and shuffled training: choose optimization function alg multiple models optimized in parallel tune L1, L2 regularization
XGBoost
eXtreme Gradient Boosting
boosted group of decision trees
gradient descent to minimize loss
can be used for classification and regression
XGBoost Input
CSV, libsvm
recently recordIO/protobuf, Parquet
XGBoost Usage
Models are serialized/deserialized with Pickle
can be used within notebook or as a built in SM algorithm
HPs: Subsample, eta, gamma, alpha, lambda
Only uses CPUs, only memory bound
Seq2Seq
Input is a sequence of tokens, output is a sequence of tokens
good for machine translation, text summarization, speech to text
Seq2Seq Input
recordIO/protobuf - tokens must be integers
start with tokenized text files
NEED TO PROVIDE TRAINING DATA, VALIDATION DATA, AND VOCAB FILES
Seq2Seq Usage
Training can take days
Pretrained models available
Public training datasets available for specific translation tasks
HPs: batch, optimizer, # layers
can optimize on accuracy, BLEU score, perplexity
only uses single machine GPU
DeepAR
forcasting 1D time-series data
uses RNNs
allows you to train the same model on several related time series
finds frequency and seasonality
DeepAR Input
JSON lines (gzip or parquet)
each record must contain: start, target
can contain dynamic/categorical features
DeepAR Usage
- always include entire time series
- uses entire dataset, remove last points for training
- don’t use very large values for prediction length
- train on many time series’ when possible
HPs: epochs, batch size, learning rate, # cells, context length
GPU or CPU for training, CPU only for inference
BlazingText
- Text Classification
predict labels for a sentence (NOT DOCS)
supervised
ex. web search, info retrieval - Word2Vec
- vector representation of words
- semantically similar words represented by vectors
close to each other > word embedding
- useful for NLP, but not an NLP algorithm
- only works on INDIVIDUAL words
BlazingText Input
- Text Class. (supervised mode)
- 1 sentence / line
- 1st word in sentence is label “label”
- augmented manifest text format - Word2Vec
- text file with 1 sentence / line
BlazingText Usage
Word2Vec has multiple modes:
- cbow > continuous bag of words (order doesn't matter) - skip-gram (order matters) - batch skip-gram (distributed over CPU nodes)
HPs:
- Word2Vec: mode, learning rate, window size, vector dim, negative samples
- Text Classification: epochs, learning rate, word n-grams, vector dim (how many words we look at together)
cbow and skipgram use GPU (can use CPU)
batch skipgram use single or multiple CPU
Text class - use CPU for smaller, GPU for larger
Object2Vec
- like Word2Vec but with arbitrary objects
- boil data down to lower level representation
- compute nearest neighbors, visualize clusters, genre prediction, recommendations
- UNSUPERVISED
Object2Vec Input
- tokenized into integers
- pairs or sequences of tokens
- sentence-sentence, labels-sequence, customer-customer, product-product, user-item
Object2Vec Usage
- process data into JSON and shuffle
- train with 2 input channels, 2 encoders, 1 comparator
- encoder choices:
- average pooled embeddings, CNN, bidirectional LSTM
- comparator is followed by a feed-forward neural network
HPs: usual deep learning ones:
- dropout, early stopping, epochs, learning rate, batch size, layers, activation function, optimizer, weight decay - encoder1 network, encoder2 network
single machine, multi GPU
use inference pref mode to optimize for encoder embeddings rather than classification or regression
Object Detection
- ID all images in an image with bounding boxes
- detect and classify with one deep neural network
- provide confidence scores
- can train images from scratch, or use pretrained models on ImageNet
Object Detection Input
- recordIO or image format (need JSON file for annotation data)