Models Flashcards

1
Q

Linear Learner - Instance types

A

Single or multi-machine CPU or GPU
Multi-GPU does not help

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Linear Learner - Hyperparams

A

Balance_multiclass_weights → gives each class equal important in loss functions
Learning rate
Mini_batch_size
L1 regularisation
Wd = weight decay = L2 regularisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Linear Learner - Model types

A

Can handle both regression (numeric) predictions and classification problems
For classification, a linear threshold function is used
Can do binary or multi-class problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Linear Learner - Input format

A

Record IO-wrapped protobuf → Float-32 data
CSV → first column is the label
File or Pipe mode both supported

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Linear Learner - Pre-processing

A

Training data should be normalised (so all features are weighted the same)
Linear learner can do this for you
Input data should be shuffled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Linear Learner - Training

A

Uses SGD
Choose an optimisation algo (Adam, Adagrad, SGD, etc)
Multiple models are optimised in parallel and chooses most optimal in validation step
Tune L1, L2 regularisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Linear Learner - Validation

A

Most optimal model is selected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

XGBoost - Model Type

A

eXtreme FGradient Boosting
Boosted group of decision trees
New trees made to correct the errors of previous trees
Uses gradient descent to minimise loss as new trees are added
Can be used for:
Classification
Regression (uses regression trees)

Can use it:
Within notebook as sagemaker.xgboost
Or use sagemaker container

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

XGBoost - Input

A

CSV
Libsvm
recordIO-protobuf
Parquet format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

XGBoost - Hyperparameters

A

Sub_sample → prevent overfitting
Eta → step size shrinkage → prevents overfitting
Gamma → minimum loss reduction to create a partition, larger = more conservative
Alpha = L1 regularisation term; larger = more conservative model
Lambda = L2 regularisation term; larger = more conservative model
Eval_metric = optimise on AUC, error, rmse if you’re optimising on accuracy. However, for focusing on false positives, you might set this to AUC
Scale_pos_weight:
-Adjusts balance of positive and negative weights
-Helps for unbalanced classes
-Might set to sum(negative cases)/sum(positive cases)
Max_depth = max depth of tree → too high and you might overfit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

XGBoost - instances

A

Uses CPUs for multiple instance training
Memory-bound → not compute bound
–> So M5 is a good choice for multiple instance
If using 1 instance

As of XGBoost 1.2, single instance GPU training is available
E.g P2 or P3 instance types
–> Must set tree_method hyperparameter to gpu_hist
–> Trains more quickly → can be more cost effective

In XGBoost 1.2.2
P2, P3, G4dn, G5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Seq2Seq - Model type

A

Input is a sequence of tokens, output is a sequence of tokens

Uses:
Machine translation
Text summarisation
Speech to test

Implemented with RNNs and CNNs with attention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Seq2Seq - Inputs

A

recordIO-protobuf → tokens must be integers (this is unusual, since most algorithms want floating point data)

Start with tokenised text files

Convert to protobuf using sample code
- Packs into integer tensor with vocab files
- A lot like TF/IDF
Must provide training data, validation data, and vocabulary files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Seq2Seq - training

A

Can take days to train
Pre-trained models are available → see example notebook
Public training datasets are available for specific translation tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Seq2Seq - Hyperparameters

A

Batch_size
Optimise_type (adam, sgd, rmsprop)
Learning_rate

Num_layers_encoder, num_layers_decoder

Can optimise on:

  • Accuracy
    – Vs provided validation dataset
  • BLEU score
    – Good for machine translation
    – Compares against multiple reference translations
  • Perplexity
    – Good for machine translation
    – Cross entropy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Seq2Seq - Instances

A

Only GPU e.g. P3
Can only use a single machine for training → but can use multiple GPUs on a single machine → but can’t be parallelized across multiple machines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

DeepAR - Model Type

A

Forecasting one-dimensional time-series data
- Allows you to train the same model over several related time series
- Finds frequencies and seasonality

Uses RNNs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

DeepAR - Input

A

JSON lines format → in GZIP or Parquet for better performance

Each record must contain:
start: the starting time stamp
Target: the time series values

Each record can contain:
Dynamic features e.g. was a promotion applied to the product during the time series + product purchases
Categorical features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

DeepAR - How is it used?

A

Always include the entire series for training, testing, and inference:
Use entire dataset as a test set, remove last time points for training → evaluate on values

Don’t use large values for predictions length (>400) → can’t do too far into the future

Train on many time series and not just one when possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

DeepAR - Hyperparameters

A

Context_length = number of time points the model sees before making a prediction
Can be smaller than seasonalities → the model will lag one year anyhow

Epochs

Mini_batch_size

Learning_rate

Num_cells = number of neurons

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

DeepAR - Instances

A

CPU or GPU
Single or multi machine
Recommendation: start with CPU (ml.c4.2xlarge then ml.c4.4xlarge)
Move up to GPU if necessary
with large mini-batch-size or with larger models
May need larger instances for tuning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

BlazingText - Model Type

A

Only for sentences → not entire documents

Text classification:
Predicts labels for a sentence
Useful in web searches, information retrieval
Supervised

Word2Vec:
Creates a vector representation of words
Semantically similar words are represented by vectors close to each other
This is called word embedding
It is useful in NLP, but is not an NLP algo itself
Used in machine translation, sentiment analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

BlazingTest - Input

A

For supervised mode (text classification)
One sentence per line
First “word” in the sentence is the string __label__ followed by the label e.g. “__label__4 hello there this is a sentence”

Also “augmented manifest text format” –> json string
Source and label field

Word2Vec just wants a text file with one training sentence per line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

BlazingText - modes of Word2Vec?

A

Word2vec has multiple modes:
CBow (continuous bag of words → order of words is thrown out, just the words themselves matter)
Skip-gram
Batch skip-gram → distributed computation over many CPU modes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

BlazingText - Hyperparameters

A

Word2vec:
Mode (batch_skipgram, skipgram, cbow)
Learning_rate
Window_size
Vector_dim
Negative_samples

Text classification:
Epochs
Learning_rate
Word_ngrams
Vector_dim

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

BlazingText - Instance types

A

Word2Vec:

For cbow and skipgram, recommend a single: ml.p3.2xlarge –> Any single CPU or GPU will work

For batch skipgram, can use single or multiple CPU instances

For text classification:

C5 is recommended for less than 2GB training data. For larger data set, use a single GPU instance (ml.p2.xlarge or ml.p3.2xlarge)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Object2Vec - Model type

A

Word2vec → finds relationships between words in a sentence

Object2Vec → can work on entire document, or other objects

Creates a low dimensional dense embeddings on high dimensional objects
- Compute nearest neighbour of objects
- Visualise clusters

Use cases:
Genre predictions
Recommendation (similar items or users)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Object2Vec - Input

A

Data must be tokenised into integers

Training data consist of pair of tokens and/or sequence of tokens:
Sentence-sentence
Labels-sentence (genre to description?)
Customer-customer
Product-product
User-user

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Object2Vec - How is it used?

A

Process data into JSON lines and shuffle it

Train with two input channels, two encoders, and a comparator

Encoder choices:
Average-pooled embeddings
CNN
Bidirectional LSTM

Comparator followed by a feed forward neural network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Object2Vec - Hyperparameters

A

Usual deep learning:
Dropout
Early stopping
Epochs
Learning rate
Batch size layer
Activation fns
Optimiser weight decay

Enc1_network and enc2_network:
Choose cnn, bilstm, pooled_embedding → choose encoder type for each channel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Object2Vec - Instances

A

Can train on only a single instance (CPU< GPU, or multi GPU):
Start with CPU: ml.m5.2xlarge
Ml.p2.xlarge
If needed go up to ml.m5.4xlarge, ml.m5.12slarge
GPU: P2, P3, G4dn, G5

Inference: ml.p3.2xlarge
Use INFERENCE_PREFERED_MODE env var to optimise for encoder embeddings rather than classification or regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Object Detection - Model Type

A

Identify all objects in an image with bounding boxes
Detects and classifies objects with a single deep neural network
Classes accompanied by confidence scores
Can train from scratch or use pre-trained models based on MXNet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Object Detection - How is it used and types?

A

Two variants: MXNet and Tensorflow

Takes an image as input, outputs all instances of objects in the image, with categories and confidence scores

Mxnet:
Uses a CNN with single shot multibox detector (SSD) algo
Transfer learning model / incremental learning
Uses flip, rescale, and jilter internally to avoid overfitting

Tensorflow:
Uses resnet, efficient net, mobilenet modes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Object Detection - Input

A

MXNet:
- recordIO or image format (jpg or png)

With image format, supply a json with annotation data for each image

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Object Detection - Hyperparameters

A

Batch size
Learning rate
Optimiser –> Sgd, adam, rmsprop

36
Q

Object Detection - Instance

A

Use GPU for training - can do multi and multi-machine
Ml.p2, ml,p2, G4dn, and G5

Inference
CPU or GPU: M5, P2, P3, G4dn

37
Q

Image Classification - Model type

A

Object detection tells you where an object is
Image Classificaiton tells you what is in the image

Assign one or more labels to an image

Doesn’t tell you where objects are, just what is it

38
Q

Image Classification - How is it used? Different Types?

A

Separate algos for mxnet and tensorflow

Mxnet:

  1. Fulltraining mode
    –> Network initialised with random weights
  2. Transfer learning
    –> Pre-trained weights
    –> The top fully connected layer is initialised with random weights
    –> Network is fine tuned with new training data
    Default image size is 3-channel 224x224 (RGB)

Tensorflow
→ uses various tensorflow hub models (mobilenet, inception, resnet, efficientnet)
→ Top classification layer is available for fine tuning and further training

39
Q

Image Classification - Hyperparameters

A

Usual deep learning:
Batch size, learning rate, optimiser

Optimiser specific
Weight decay, beta 1, beta 2, eps, gamma
Slight difference between mxnet and tensorflow

40
Q

Image Classification - Instances

A

GPU for training (multi GPU, and multi instances
P2, p3, g4dn, g5

CPU or GPU for Inference
M5, p2, p3, g4dn, g5

41
Q

Semantic Segmentation - Model type

A

Pixel-level object classification:

Rather than just a bounding box
Shows you EXACTLY where the object is
Useful for self-driving vehicles, medical diagnostics, robot sensing
Produces a segmentation mask

42
Q

Semantic Segmentation - How is it used?

A

Built on mxnet and Gluon CV

Choice of 3 algos (decoders –> constructs segmentation mask):
Fully conv net (FCN)
Pyramid scene Parsing (PSP)
DeepLabV3

Choice of backbones (or encoder –> applies activation fn to features):
Resnet50, resnet101, both trained on imagenet

Incremental training, or training from scratch, both supported

43
Q

Semantic Segmentation -Training input

A

JPG images and PNG annotation
For both training and validation

Label maps to describe annotations

Augmented manifest image format supported for Pipe mode

44
Q

Semantic Segmentation - Inference Input

A

JPEG image

45
Q

Semantic Segmentation - Hyperparameters

A

Epochs, learning rate, batch size, optimiser etc
Algorithms
Backbones

46
Q

Semantic Segmentation - Instance

A

Only GPU for training: P2, P3, G4dn, G5
Only single instance

Instance CPU (C5 or M5) or GPU (P3 or G4dn)

47
Q

Random cut forest - Model type

A

Anomaly detection:

Unsupervised

Detect unexpected spikes in time series
Breaks in periodicity
Unclassified data points
Assigns an anomaly score to each data points

Based on an algo AWS made

48
Q

Random cut forest - training input

A

RecordIO-protobuf or csv

Can use file or pipe mode

Optional test channel for computing accuracy, precision. Recall etc → on something where you know where the anomalies are

49
Q

Random cut forest - how is it used?

A

Creates a forest of trees where each tree is a partition of the training data → looks at expected change in complexity of the tree as a result of adding a point to it

Data is sampled randomly

Then trained

RCF shows up in Kinesis Analytics as well → anomaly detection on streaming data

50
Q

Random cut forest - Hyperparams

A

Number of trees
Increase → reduces noise

Num samples per tree
Should be chosen such that 1/num_samples_per_tree approaches the rate of anomalous to normal data

51
Q

Random cut forest - Instances

A

No GPUS
Use m4, c4, c5 for training

C5 for inference

52
Q

Neural Topic Model - Model type

A

What is a document about?

Unsupervised

Natural variational inference
Organise documents into topics
Classify or summarise documents based on topics
–> Not just TF-IDF
–> Won’t return topic name, but will groups docs

53
Q

Neural Topic Model - input

A

Four data channels:
“Train” is required
Validation, test and auxiliary are optional

Recordio-protobuf or csv

Words must be tokenized into integers

Every doc must contain a count for every word in the vocabulary in CSV

The auxiliary channel is the vocabulary, mapping tokens to words

File or pipe mode

54
Q

Neural Topic Model - How is it used?

A

You define how many topics you want
These topics are a latent representation based on top ranking words
Topics will not be human readable words
One of 2 topics modelling algos in sagemaker

55
Q

Neural Topic Model - Hyperparams

A

Batch size and learning rate:
Can reduce validation loss, at expense of training time

Num_topics

56
Q

Neural Topic Model - Instances

A

GPU or CPU
GPU for training
CPU adequate for inference

57
Q

LDA - how is it used?

A

Unsupervised: generates however many topics you specify

Optional test channels can be used for scoring results
Per word log likelihood shows how well it works

Functionality similar to NTM, but CPU based
Therefore much cheaper / efficient

58
Q

LDA - Model type

A

Sagemakers other topic modelling algo

Latent dirichlet allocations

Unsupervised
Topics themselves are unlabeled → just groupings of documents with a shared subset of words
NTM is another unsupervised topic identification algo

Not deep learning

Can be used for things other than words:
Cluster customers based on purchases
Harmonic analysis in music

59
Q

LDA - Inputs

A

Train channel, optional test channel
Redordio-protbuf or csv
Each document has counts for every word in vocab for that document
Pipe mode only supported with recordio

60
Q

LDA - Hyperparams

A

Num topics

Alpha0:
Initial guess for concentration parameter
Smaller values generate sparse topic mixtures
Larger values (>10) produce uniform mixtures

61
Q

LDA - Instances

A

Single CPU instance

62
Q

K-Nearest Neighbours KNN - Model Type

A

Simplification classification or regression algo
Technically supervised → labelled

Classification:
Find the k closest point to a sample and return the most frequent label

Regression:
Find the k nearest neighbours and return the average value

63
Q

K-Nearest Neighbours KNN - How is it used

A

Data is first sampled

Sagemaker includes a dimensionality reduction stage:
Avoid sparse data (curse of dimensionality)
At cost of noise/accuracy
Optionas: Sign or fjit methods

Builds an index for looking at neighbours

Serialise the model

Query the model for a given k

64
Q

K-Nearest Neighbours KNN - Inputs

A

Train channel contains your data

Test channel emits accuracy or MSE

recordIO-protbuf or csv
Csv: first column contains label

Pipe or file mode

65
Q

K-Nearest Neighbours KNN - Hyperparams

A

K
Sample_size

66
Q

K-Nearest Neighbours KNN - Instances

A

Training on cpu or gpu:
M5 or p2

Inferences
CPU for lower latency
Gpu for higher throughput on large batches

67
Q

K-means - Model type

A

Unsupervised clustering

Divide data into k groups, where members of a group are as similar to each other as possible
You define what similar means
Measured by euclidean distance

Web-scale k-means clustering

68
Q

K-means - Input

A

Train channel, optional test:
Train SharedByS3Key, test FullyReplicated

RecordIO-protobuf or CSV

File or Pipe mode

69
Q

K-means - How is it used?

A

Every observation to n-dimensional space (n=number fo features)

Works to optimise the centre of K clusters”
–> “Extra cluster centres” may be specified to improve accuracy (which end up getting reduced to k)
–> K=k*x → K is the initial number of clusters, want to reduce this down to k

Algorithm:
Determine initial cluster centres
- Random or k-mean++ approach
- K-means++ tries to make initial clusters far apart
Iterate over training data and calculate cluster centres
Reduce clusters from K to k
- Using Lloyds method with kmeans++

70
Q

K-means - Hyperparams

A

K:
Chosing K is tricky
Pilot within cluster sum of squares as a function of k
Use “elbow emthod”
Basically optimise for tightness of clusters

Batch size
Extra centre factor
Init method

71
Q

K-means - instances

A

CPU or GPU, but cpu recommended
Only one GPU per instance used on GPU → g4dn if going to GPU
P2, p3, g4dn, and g4 supported

72
Q

PCA - Model type

A

Principal component analysis

Dimensionality reduction:
- Project higher-dimensionality data (lots of features) into lower dimensional space while minimising the loss of information
- The reduced dimensions are called components
- First component has largest possible variability
- Second component has the next largest
- Unsupervised

73
Q

PCA - How is it used?

A

Covariance matrix is created, then SVD (single value decomposition)

Two modes:

Regular
–> For sparse data and moderate number of observations and features

Randomised
–> For large number of observations and features
Uses approximation algorithm

74
Q

PCA - Input

A

Recordio-protobuf or csv
File or pipe mode

75
Q

PCA - Hyperparams

A

Algortihm_mode
Subtract_mean → unbiased the data

76
Q

PCA - Instance

A

GPU or CPU
It depends on the specifics of the input data → need to experiment

77
Q

Factorisation machines - model type

A

Dealing with sparse data:
Click prediction (= individual user does not interact with majority pages on a website, but they do interact with a few pages)
Item recommendations
Since an individual user doesn’t interact with most pages / products the data is sparse

Supervised

Classification or regression

Limited to pair-wise interactions:
User → item for example

78
Q

Factorisation machines - Input

A

Redcordio-protobuf format with float32

Sparse data means csv isn’t practical → loads of commas

79
Q

Factorisation machines - How is it used?

A

Find factors we can use to predict a classification (click or not? Purchase or not?) or value (predicted rating?) given a matrix representing some pair of things

Usually used in the context of recommender systems

80
Q

Factorisation machines - hyperparams

A

Initialisation methods for bias, factors and linear terms
- Uniform, normal, or constant
- Can tune properties of each method

81
Q

Factorisation machines - Instances

A

Cpu or gpu
Cpu recommended
Gpu only works with dense data

82
Q

IP Insights - Model type

A

Unsupervised learning of ip address usage patterns

Identifies suspicious behaviour from ip address
Identify logins from suspicious ip addresses
Identify accounts creating resources from anomalous ips

83
Q

IP Insights - Input

A

User names, accounts IDs can be fed in directly, no need to preprocess
Training channel, optional validation (computes AUC score)
CSV only → entity, IP

84
Q

IP Insights - How is it used?

A

Uses a neural network to learn latent vector representation of entities and ip addresses

Entities are hashed and embedded:
Need sufficiently large hash size

Automatically generates negative samples during training by randomly pairing entities and ips

85
Q

IP Insights - Hyperparams

A

Num entity vectors:
Hash size
Set to twice the number of unique entity identifiers

Vector dim:
Size of embedding vectors
Scales model size
Too large results in overfitting

Epochs, learning rate, batch size etc

86
Q

IP Insights - Instances

A

CPU or GPU
Gpu recommended e.g. p3 or higher
Can use multiple GPUs
Size of CPU depends on vector dim and number of vectors