Modeling 2 Flashcards
Object Detection
it detects objects in an image with bounding boxes
How does Object Detection work ?
with a single deep neural network
CNN with Single shot multibox
Detector (SDD) algorithm
- CNN can be VGG-16 or ResNet-50
how does Object Score provide confidence?
using a confidence score
how to train object detection?
i. train from scratch
ii. use pre-trained models based on ImageNet
Object Detection input?
RecordIO / image format (JPG, PNG)
for training images
JSON to provide metadata like bounding boxes and labels per image
Object Detection output
all instances of objects in the image with categories and confidence scores
Object Detection transfer learning mode
use pre-trained model for the base network weights, instead of random initial weights
how does Object Detection avoid over fitting?
flip
rescale
jitter
Object Detection hyperparameters
usual ones in a CNN
mini_batch_size
learning_rate
optimizer
- sgd, adam, rmsprop, adadelta
Object Detection instance types
GPU instances for training (honestly it’s the demanding CNN)
multi GPU multi Machine (scales up nicely)
ml. p2.xlarge
ml. p2.8xlarge
ml. p2.16xlarge
ml. p3.2xlarge
ml. p3.8clarge
ml. p3.16xlarge
for inference:
CPU or GPU
C5, M5, P2, P3
Image Classification
like Object detection but simpler
doesn’t tell you where objects are but gives you label for the image
Image Classification Input
Apache MXNet RecordIO
- not protobuf
- for interoperability with other deep learning frameworks
Raw jpg, png images
image format requires .lst files to associate
- image index
- class label
- path to image
augmented manifest image format enables pipe mode
what is pipe mode?
allows you to stream data from s3 instead of copy the data over
How does Image Classification works?
ResNet CNN
full training mode:
- network initialized with random weights
Transfer Learning mode:
- initialized with pre-trained weights
- top fully-connected layer is initialized with random weights
- network is fine-tuned with new training data
Default image specifications for Image Classification
224x224
3-channel
(imageNet’s dataset)
Image Classification hyperparameters
batch size
learning rate
optimizer
optimizer-specific parameters
- weight decay
- beta 1
- beta 2
- eps
- gamma
Image classification instance types
Multi-gpu multi-machine
GPU instance for training (p2,p3)
GPU or CPU for inferences (C4, p2, p3)
Semantic Segmentation
Pixel level object classification
not like object detection with bounding boxes
not like image classification with labels
Semantic Segmentation use cases
self-driving vehicles
medical imaging diagnosis
robot sensing
Semantic segmentation output
segmentation mask
Semantic Segmentation training
jpg, png
label maps to describe annotations
- for training and validation
augmented manifest image format supported for pipe mode
jpg images accepted for inference
Semantic Segmentation
MXNet Gluon and Gluon CV
Semantic Segmentation algorithms
Fully-Convolutional Network (FCN)
Pyramid Scene Parsing (PSP)
DeepLabV3
Choices of backbones for Semantic Segmentation
ResNet50
ResNet101
Both trained on ImageNet
Semantic Segmentation training from scratch or incremental
both are supported
Semantic Segmentation hyperparameters
epochs learning rate batch size optimizer algorithm backbone
Semantic Segmentation instance types
GPU only: P2, P3
Single Machine Only
ml. p2.xlarge
ml. p2.8xlarge
ml. p2.16xlarge
ml. p3.8xlarge
ml. p3.16xlarge
Inference instances for Semantic Segmentation
CPU C5, M5
GPU P2, P3
Random cut forest
anomaly detection
unsupervised
detect unexpected spikes in time series data
breaks in periodicity
unclassifiable data points
based on an algorithm developed by amazon
random cut forest output
assigns an anomaly score to each data point
random cut forest training input
RecordIO-protobuf or CSV
can use file or pipe mode on either
optional test channel for computing accuracy, precision, recall and F1 on labeled data
How does random cut forest work?
creates a forest of trees
where each tree is a partition of the training data
looks at expected change in complexity of the tree as a result of adding a point to it
how data is sampled in random cut forest ?
Randomly sampled and then trained
is it possible to use random cut forest in Kinesis Analytics?
yes it is. it can work ok streaming data too.
random cut forest hyperparameters
num_trees
- increasing reduces noise
num_samples_per_tree
- 1/num_samples_per_tree approximates the ratio of anomalous to normal data
Random cut forest instance types
does not use GPU
use M4, C4, C5 for training
ml.c5.xl for inference
Neural Topic Modeling
organize documents into topics
classify or summarize dox based on topics
not just TF-IDF
unsupervised
Neural Topic Modeling algorithm
Neural Variational Inference
Training input for Neural Topic Modeling
Four data channels
- train is required
- validation, rest and auxiliary are optional
recordIO-protobuf or CSV
words must be tokenized into integers
every document must contain a count for every word in the vocabulary in CSV
the auxiliary channel is for the vocabulary
file or pipe mode which obviously pipe is faster
how to use Neural Topic Modeling
define how many topics we have
does the Neural Topic Modeling give us topic names ?
No, topics are a latent representation based on top ranking words
one of two topic modeling algorithms in SageMaker - you can try them both
Neural topic model
important hyperparameters
lowering mini_batch_size
and learning_rate can reduce validation loss at expense of training time
num_topics
Neural Topic Modeling instance types
GPU or CPU
GPU recommended for training
CPU which is cheaper is ok for inference
Latent Dirichlet Allocation (LDA)
topic modeling not based on Deep Learning
unsupervised
- topics are unlabeled, which means they are just groupings of documents with a shared subset of words
can be used for things other than words
how can you use LDA for things other than words ?
cluster customers based on purchases
harmonic analysis in music
LDA input for training
Train Channel, Optional Test Channel
RecordIO-protobuf or CSV
Each doc has counts for every word in vocabulary (CSV)
pipe mode only supported with RecordIO
LDA:
un/supervised?
unsupervised
LDA:
optional test channel can be used for … ?
Scoring results
- per-word log likelihood
LDA vs Topic modeling
similar to NTM but CPU based
- therefore cheaper / more efficient
LDA hyperparameters
num_topics
alpha0
- initial guess for concentration parameter
- smaller values generate sparse topic mixtures
- larger values (>1.0) produce uniform mixture
LDA instance type
Single CPU
KNN
K-Nearest-Neighbors
Simple Classification or regression algorithm
supervised
KNN Classification
find the K closest points to a sample point and return the most frequent label
KNN Regression
Find the K closest points to a sample point and return the average value
KNN input
Training channel, contains data
Test channel, emits accuracy or MSE
RecordIO-protobuf or CSV training
- first column is label
File or Pipe mode, either
KNN in SageMaker, how does it work?
1- Data is sampled
2- SageMaker includes a dimensionality reduction stage
- avoid sparse data (Curse of dimensionality)
- at cost of noise / accuracy
- sign or fjlt methods
3- built an index for looking up neighbours
4- serialize the model
5- query the model for a given K
KNN hyperparameters
K!
Sample_size
KNN Instance types
Training on CPU or GPU
- ml.m5.2xlarge
- ml.p2.xlarge
Inference
- CPU for lower latency
- GPU for higher throughput on large batches
K-Means
unsupervised clustering
divide the data into K groups where members of a group are similar as possible to each other
- you define similar
- measured by Euclidean distance
SageMaker offers web-scale k-means clustering
K-Means input
training channel
optional test
- train ShardedByS3Key,
- test FullyReplicated
RecordIO-protobuf or CSV
File or Pipe on either
K-Mean under the hood
every observation mapped to n-dimensional space
n is number of features
works to optimize the center of K clusters
“extra cluster centers” may be specified to improve accuracy (which end up getting reduced ti k)
K = k * x
K-Mean Algorithm
Determine initial cluster centers
- random or k-means++approach
- K-means++tries to make initial clusters far apart
Iterate over training data and calculate cluster centers
Reduce clusters from K to k
- using Lloyd’s method with k-means++
K-Mean hyperparameters
K!
- choosing k is tricky
- plot within-cluster sum of squares as function of K
- elbow method
- basically optimize for tightness of clusters
mini_batch_size
extra_center_factor
Init_method
K-Mean Instance type
CPU or GPU but CPU recommended
only one GPU/instance on GPU
p*.xlarge
PCA
Principal Component Analysis
Dimensionality Reduction
avoid the curse of dimensionality
while minimizing loss of information
PCA
un/supervised?
unsupervised
what are the reduced dimensions called?
Components
first component has largest possible variability
second component has the next largest
PCA input
recordIO-protobuf or CSV
File or Pipe on either
PCA under the hood?
Covariance matrix is created
then singular value decomposition (SVD)
Two modes
- regular
for sparse data and moderate number of observation and features
- randomized
for large number of observations and features
uses approximation algorithm
PCA hyperparameters
Algorithm_mode
Subtract_mean
- unbiased data
PCA instance type
CPU or GPU
- it depends on the specifics of the input data
Factorization Machines
Classification and regression
Dealing with Sparse data
Factorization Machines use cases
Click Prediction
Item Recommendations
Since an individual user doesn’t interact with most pages / products the data is sparse
Factorization Machines
un/supervised?
supervised
- Classification or Regression
is it limited to pair-wise interactions
yes. e.g. user - item
Factorization Machines input
recordIO-protobuf with Float32
- Sparse data means CSV isn’t practical
Factorization Machines, how does it work ?
Finds factors we can use to predict a classification
e.g. Click or not / Purchase or not
or value (predicted rating?)
given a matrix representing some pair of things
(users and items)
usually used in the context of recommender systems
Factorization Machines hyperparameters
initialization methods for bias, factors, and linear terms
- uniform, normal or constant
- can tune properties of each method
Factorization Machines instance types
CPU or GPU
CPU recommended
GPU only works for dense data
IP insights
finding fishy behaviour
identify suspicious behaviour from ip address
identify logins from anomalous ip’s
identify accounts creating resources from anomalous IP’s
IP insights
un/supervised
unsupervised
IP insights input
username account id(raw data no need to pre-process)
training channel, optional validation (computes AUC scores)
CSV only (Entity, IP)
IP insights, how is it used?
uses a neural network to learn latent vector representations of entities and ip addresses
entities are hashed and embedded.
- need sufficiently large hash size
automatically generates negative samples during training by randomly pairing entities and IP’s
IP Insights hyperparameters
num_entity_vectors
- hash size
- set to twice the number of unique entity identifiers
Vector_dim
- size of embedding vectors
- scales model size
- too large results in overfitting
Epochs, Learning rate, batch size, etc.
IP Insights instance type
CPU or GPU GPU recommended ml.p3.2xlarge or higher can use multiple GPU size of CPU depends on - vector_dim - num_entity_vectors