Models Flashcards
Linear Learner - Instance types
Single or multi-machine CPU or GPU
Multi-GPU does not help
Linear Learner - Hyperparams
Balance_multiclass_weights → gives each class equal important in loss functions
Learning rate
Mini_batch_size
L1 regularisation
Wd = weight decay = L2 regularisation
Linear Learner - Model types
Can handle both regression (numeric) predictions and classification problems
For classification, a linear threshold function is used
Can do binary or multi-class problems
Linear Learner - Input format
Record IO-wrapped protobuf → Float-32 data
CSV → first column is the label
File or Pipe mode both supported
Linear Learner - Pre-processing
Training data should be normalised (so all features are weighted the same)
Linear learner can do this for you
Input data should be shuffled
Linear Learner - Training
Uses SGD
Choose an optimisation algo (Adam, Adagrad, SGD, etc)
Multiple models are optimised in parallel and chooses most optimal in validation step
Tune L1, L2 regularisation
Linear Learner - Validation
Most optimal model is selected
XGBoost - Model Type
eXtreme FGradient Boosting
Boosted group of decision trees
New trees made to correct the errors of previous trees
Uses gradient descent to minimise loss as new trees are added
Can be used for:
Classification
Regression (uses regression trees)
Can use it:
Within notebook as sagemaker.xgboost
Or use sagemaker container
XGBoost - Input
CSV
Libsvm
recordIO-protobuf
Parquet format
XGBoost - Hyperparameters
Sub_sample → prevent overfitting
Eta → step size shrinkage → prevents overfitting
Gamma → minimum loss reduction to create a partition, larger = more conservative
Alpha = L1 regularisation term; larger = more conservative model
Lambda = L2 regularisation term; larger = more conservative model
Eval_metric = optimise on AUC, error, rmse if you’re optimising on accuracy. However, for focusing on false positives, you might set this to AUC
Scale_pos_weight:
-Adjusts balance of positive and negative weights
-Helps for unbalanced classes
-Might set to sum(negative cases)/sum(positive cases)
Max_depth = max depth of tree → too high and you might overfit
XGBoost - instances
Uses CPUs for multiple instance training
Memory-bound → not compute bound
–> So M5 is a good choice for multiple instance
If using 1 instance
As of XGBoost 1.2, single instance GPU training is available
E.g P2 or P3 instance types
–> Must set tree_method hyperparameter to gpu_hist
–> Trains more quickly → can be more cost effective
In XGBoost 1.2.2
P2, P3, G4dn, G5
Seq2Seq - Model type
Input is a sequence of tokens, output is a sequence of tokens
Uses:
Machine translation
Text summarisation
Speech to test
Implemented with RNNs and CNNs with attention
Seq2Seq - Inputs
recordIO-protobuf → tokens must be integers (this is unusual, since most algorithms want floating point data)
Start with tokenised text files
Convert to protobuf using sample code
- Packs into integer tensor with vocab files
- A lot like TF/IDF
Must provide training data, validation data, and vocabulary files
Seq2Seq - training
Can take days to train
Pre-trained models are available → see example notebook
Public training datasets are available for specific translation tasks
Seq2Seq - Hyperparameters
Batch_size
Optimise_type (adam, sgd, rmsprop)
Learning_rate
Num_layers_encoder, num_layers_decoder
Can optimise on:
- Accuracy
– Vs provided validation dataset - BLEU score
– Good for machine translation
– Compares against multiple reference translations - Perplexity
– Good for machine translation
– Cross entropy
Seq2Seq - Instances
Only GPU e.g. P3
Can only use a single machine for training → but can use multiple GPUs on a single machine → but can’t be parallelized across multiple machines
DeepAR - Model Type
Forecasting one-dimensional time-series data
- Allows you to train the same model over several related time series
- Finds frequencies and seasonality
Uses RNNs
DeepAR - Input
JSON lines format → in GZIP or Parquet for better performance
Each record must contain:
start: the starting time stamp
Target: the time series values
Each record can contain:
Dynamic features e.g. was a promotion applied to the product during the time series + product purchases
Categorical features
DeepAR - How is it used?
Always include the entire series for training, testing, and inference:
Use entire dataset as a test set, remove last time points for training → evaluate on values
Don’t use large values for predictions length (>400) → can’t do too far into the future
Train on many time series and not just one when possible
DeepAR - Hyperparameters
Context_length = number of time points the model sees before making a prediction
Can be smaller than seasonalities → the model will lag one year anyhow
Epochs
Mini_batch_size
Learning_rate
Num_cells = number of neurons
DeepAR - Instances
CPU or GPU
Single or multi machine
Recommendation: start with CPU (ml.c4.2xlarge then ml.c4.4xlarge)
Move up to GPU if necessary
with large mini-batch-size or with larger models
May need larger instances for tuning
BlazingText - Model Type
Only for sentences → not entire documents
Text classification:
Predicts labels for a sentence
Useful in web searches, information retrieval
Supervised
Word2Vec:
Creates a vector representation of words
Semantically similar words are represented by vectors close to each other
This is called word embedding
It is useful in NLP, but is not an NLP algo itself
Used in machine translation, sentiment analysis
BlazingTest - Input
For supervised mode (text classification)
One sentence per line
First “word” in the sentence is the string __label__ followed by the label e.g. “__label__4 hello there this is a sentence”
Also “augmented manifest text format” –> json string
Source and label field
Word2Vec just wants a text file with one training sentence per line
BlazingText - modes of Word2Vec?
Word2vec has multiple modes:
CBow (continuous bag of words → order of words is thrown out, just the words themselves matter)
Skip-gram
Batch skip-gram → distributed computation over many CPU modes
BlazingText - Hyperparameters
Word2vec:
Mode (batch_skipgram, skipgram, cbow)
Learning_rate
Window_size
Vector_dim
Negative_samples
Text classification:
Epochs
Learning_rate
Word_ngrams
Vector_dim
BlazingText - Instance types
Word2Vec:
For cbow and skipgram, recommend a single: ml.p3.2xlarge –> Any single CPU or GPU will work
For batch skipgram, can use single or multiple CPU instances
For text classification:
C5 is recommended for less than 2GB training data. For larger data set, use a single GPU instance (ml.p2.xlarge or ml.p3.2xlarge)
Object2Vec - Model type
Word2vec → finds relationships between words in a sentence
Object2Vec → can work on entire document, or other objects
Creates a low dimensional dense embeddings on high dimensional objects
- Compute nearest neighbour of objects
- Visualise clusters
Use cases:
Genre predictions
Recommendation (similar items or users)
Object2Vec - Input
Data must be tokenised into integers
Training data consist of pair of tokens and/or sequence of tokens:
Sentence-sentence
Labels-sentence (genre to description?)
Customer-customer
Product-product
User-user
Object2Vec - How is it used?
Process data into JSON lines and shuffle it
Train with two input channels, two encoders, and a comparator
Encoder choices:
Average-pooled embeddings
CNN
Bidirectional LSTM
Comparator followed by a feed forward neural network
Object2Vec - Hyperparameters
Usual deep learning:
Dropout
Early stopping
Epochs
Learning rate
Batch size layer
Activation fns
Optimiser weight decay
Enc1_network and enc2_network:
Choose cnn, bilstm, pooled_embedding → choose encoder type for each channel
Object2Vec - Instances
Can train on only a single instance (CPU< GPU, or multi GPU):
Start with CPU: ml.m5.2xlarge
Ml.p2.xlarge
If needed go up to ml.m5.4xlarge, ml.m5.12slarge
GPU: P2, P3, G4dn, G5
Inference: ml.p3.2xlarge
Use INFERENCE_PREFERED_MODE env var to optimise for encoder embeddings rather than classification or regression
Object Detection - Model Type
Identify all objects in an image with bounding boxes
Detects and classifies objects with a single deep neural network
Classes accompanied by confidence scores
Can train from scratch or use pre-trained models based on MXNet
Object Detection - How is it used and types?
Two variants: MXNet and Tensorflow
Takes an image as input, outputs all instances of objects in the image, with categories and confidence scores
Mxnet:
Uses a CNN with single shot multibox detector (SSD) algo
Transfer learning model / incremental learning
Uses flip, rescale, and jilter internally to avoid overfitting
Tensorflow:
Uses resnet, efficient net, mobilenet modes
Object Detection - Input
MXNet:
- recordIO or image format (jpg or png)
With image format, supply a json with annotation data for each image