Modeling Flashcards
Types of Neural Network
Feedforward
Convolutional Neural Network
Recurrent Neural Network
Convolutional Neural Network (CNN)
Image Classification
Recurrent Neural Network
for sequences
e.g. Stock Prices, Words in a sentence…
- LSTM, GRU
LSTM full format
Long Short Term Memory
GRU full format
Gated Recurrent Unit
There might be Feature-Location Invariant, what to do?
like not sure where the sign is in our image, then use CNN
adversarial example
An adversarial example is an instance with small, intentional feature perturbations that cause a machine learning model to make a false prediction.
another e.g. Sentiment Analysis
MaxPooling1D
MaxPooling2D
MaxPooling3D
distill the input down to the bear essence of what you need to analyse
Conv1D
Conv2D
Conv3D
these layer types does the actual convolution
1D like text
2D like Images
3D like 3D volume metric data
Typical Image process using CNN. what’s the process?
Conv2D:
- does the convolution
MaxPooling2D:
- distill down and shrink image
Dropout:
- Prevents overfitting
Flatten:
flatten data to feed it into a perceptron
Dense:
hidden layer of neurons, perceptron
Dropout:
Softmax:
choose the final classification that comes out of the neural network
Name some Specialised Architectures of CNN
LeNet-5:
- handwriting recognition
AlexNet:
- Image Classification, Deeper than LeNet
GoogLeNet:
- deeper than LeNet but better performance
it uses Inception Modules.
ResNet:
- Residual Network, even deeper but maintains performance using Skip Connections
Recurrent Neural Network (RNN) topologies
Where sequence matters
- Sequence to Sequence
- Sequence to Vector
- Vector to sequence
- Encoder -> Decoder
Sequence to Sequence NN
time-series
output time-series
e.g. Stock prices
Sequence to Vector NN
e.g. Words in a sentence to sentiments
Vector to Sequence NN
e.g. produce a caption from an image
Encoder Decoder
e.g.
Sequence to Vector to Sequence
capture words in a french sentence and put them into vectors and then translate to english
Training RNN
Backpropagating both through the neural network and also time
Really hard
Sensitive to hyperparameters
Resource intensive
LSTM
maintains both long term and short term states
GRU
Gated Recurrent Unit
Simplified LSTM
What if pick some wrong choices in training a RNN?
it might lead to a RNN that doesn’t converge at all
AWS offers for Training a neural network?
Apache MXNet on EMR
P2, P3, G Instance types
Deep Learning AMI
Major Components of Tuning a Neural Network? (hyperparameters)
Some knobs and dials:
- Learning Rate
- Batch size
- epochs
Learning Rate
Gradient Descent or other means
Too high LR:
- overshoot the optimal solution
Too Low LR:
- take too long to find the optimal solution
Batch Size
Small batch sizes can work their way out of local minima more easily
Large Batch sizes can end up getting stuck in the wrong solution
Random shuffling at each epoch can make this look like very inconsistent results from run to run
Learning Rate and Training
Small LR will increase the training time
Large LR can overshoot the correct solution
Regularization Techniques. what they do?
it prevents overfitting
If you are overfitting?
try simpler model
try fewer neurons
try fewer layers
Dropout:
- remove some neurons at random at each training set to force the model to spread itself and learning more
Early Stopping:
- on the point that accuracy goes high but validation accuracy not
Vanishing Gradient Problem
Opposite of Exploding Gradients
Vanishing Gradient is when the slope of the learning curve approaches zero
Addressing Vanishing Gradient Problem
Multi-level hierarchy
- train sub-networks instead of the whole network
LSTM
Residual Network
- ResNet, for object recognition
Better choices of Activation Function
- ReLu
Gradient Checking
a debugging technique
Numerically check the derivatives computed during training
Useful for validating code of neural network
L1 and L2 Regularazation
L1 is sum of the weights
L2 is sum of square of the weights
to prevent over fitting
L1 and L2 differences?
L1: Sum of weights
- performs feature selection
- Computationally inefficient
- sparse output
L2: Sum of square of weights
- All features remain considered. just weighted
- computationally efficient
- Dense output
Why L1 then over L2?
Feature selection reduces the dimensionality
- out of 100 features, maybe only 10 endup with non-0 coefficients
- resulting sparsity can make up for its computational inefficiency
on the other side, if you think all the features are important, then go for L2
Confusion Matrix
T/PN
F/PN
Predicted Yes, Actual Yes
- True Positive
Predicted Yes, Actual No
- False Positive
Predicted No, Actual Yes
- False Negative
Predicted No, Actual No
- True Negative
Multi-class confusion matrix
including a heat map it’s useful for multi-class classification
Precision
TP / TP+FP
Captured over Number of nominated
AKA
- Percent of relevant results
- Correct Positives
when FP are important
e.g. Medical screening, drug testing
Recall
TP / TP+FN
Captured over came
AKA:
- Sensitivity, TP rate, Completeness
- % of negatives wrongly predicted
Good for when FN is critical
- e.g. Fraud Detection
F1
2TP / 2TP+FP+FN
2 * (Precision*Recall)/(Precision+Recall)
Harmonic mean of precision and sensitivity
when you care about precision and recall
Specificity
TN / TN+FP
True Negative rate
RMSE
Root mean squared error
Accuracy measurement
Only care about right and wrong answers
ROC
Receiver Operating Characteristic Curve
TP vs FP at various threshold setting
points above diagonal represent good classification
better than random
the more it’s bent toward upper-left the better
ROC
Receiver Operating Characteristic Curve
TP vs FP at various threshold setting
points above diagonal represent good classification
better than random
the more it’s bent toward upper-left the better
Ensemble Learning
Ensemble model takes multiple model and they might be just variations of the same model and lets them all vote on the final result
Bagging and Boosting
Are Decision Trees prone to overfitting?
yes they are!
Bagging
Generate multiple training sets by random sampling with replacement
each resampled model can be trained in parallel
they end up being a more robust than a single model
Boosting
works in a serial manner vs the parallel bagging
it assigns weights to each observation to a dataset
Training is sequential starting with equal weight for each observation
Bagging vs Boosting
XGBoost is one hot algorithm today
XGBoost’s strength
Accuray
What Bagging is good for?
avoid overfitting
having a regularization effect
Bagging is easier to parallelize
Some Ideal file format for SageMaker to fetch data from S3
RecordIO
Protobuf
Can you use SageMaker within Spark?
yes you can
SageMaker Neo
to deploy to Edge Devices
SageMaker
- Linear Learner
Linear Regression
- Numeric Prediction
- Classification (binary/multi-class)
Linear Learner input format
Performant options:
- RecordIO-wrapped protobuf (float32 only)
CSV
- First column assumed to be header
File or Pipe mode both supported
File vs Pipe mode
File mode copy data to the fleet
Pipe will stream required data only, that’s why it’s more efficient
Linear Learner Preprocessing
Normalized
- all features weighted the same
- Linear Maker can do this optionally
Shuffle the data
Linear Learner Training
Uses SGD
- optimization algorithms: Adam, AdaGrad, SGD, etc
Multiple models are optimized in parallel
Tune L1, L2 regularization
Linear Learner Validation
most optimal model is selected
Linear Learner Hyperparameters
Balance_multiclass_weights - gives each class equal importance in loss functions
Learning_rate, mini_batch_size
L1
- Regularization
Wd
- Weight decay (L2 regularization)
Linear Learner Instance types
Single or multi-machine CPU/GPU
Multi-GPU does not help in this case
XGBoost
eXtreme Gradient Boosting
- Boosted group of decision trees
- New trees made to correct the errors of previous trees
- Uses gradient descent to minimize loss as new trees are added
XGBoost industry trend
on Kaggle it is talk of the town
and it is also fast (not resource intensive)
XGBoost is for Classification or Regression?
Both
it does regression as well using regression trees
XGBoost input format
CSV or Libsvm
no Protobuf here
XGBoost, how is it used?
models are serialized/deserialized with pickle
can use a framework within notebooks
- sagemaker.xgboost
or as a built-in algorithm
XGBoost Hyperparameters
Subsample
- Prevent overfitting
Eta
- Step size shrinkage, prevents overfitting
Gamma
- Minimum loss reduction to create a partition
- larger value = more conservative
Alpha
- L1 regularization term
- Larger value= more conservative
Lambda
- L2 regularization term
- Larger = more conservative
XGBoost instance types
CPU only
in-memory bound, not compute bound
M4 is a good choice
Seq2Seq use cases
input sequence of tokens
output sequence of tokens
Machine Translation
Text Summarization
Speech to text
Implemented with RNN’s and CNN’s with attention
Seq2Seq training input
RecordIO-Protobuf
- Tokens must be integers (yes others mostly want floating point data)
Start with Tokenized text files
Convert to Protobuf using sample code
- Packs into integer tensors with vocabulary files
- a lot like the TF-IDF lab
Must provide:
- Training Data
- Validation Data
- Vocabulary files
Is there any pre-trained model for SageMaker?
Yay there are many
also public training dataset are available for specific translation tasks.
Seq2Seq Hyperparameters
Batch_size
Optimizer_type:
- adam
- sgd
- rmsprop
Learning_rate
Num_layers_encoder/decoder
Can optimize on:
- Accuracy: vs. provided validation dataset
- BLEU score: compares against multiple ref translation
- Perplexity: cross-entropy
Seq2Seq instance type
only GPU (e.g. P3)
only one machine but it can come with multiple GPUs
DeepAR
Forecasting one-dimensional time series data
uses RNN’s
Allows training the same model over several related time series
Fins frequencies and seasonality
DeepAR training input
JSON lines format
- GZIP or Parquet
Each record must contain:
- Start timestamp
- Target
Each record can contain:
- Dynamic_feat: dynamic features (such as, was a promotion applied to a product in a time series)
- Cat: Categorical features
DeepAR
Which part of data for input?
always include the entire time series for training, Testing and inference
use entire dataset as test set, remove last time points for training. evaluate on withheld values.
Don’t use very large values for prediction length (>400)
Train on many time series and not just one when possible
maximum recommended prediction length in DeepAR
400 data points in future
DeepAR Hyperparameters
- context_length ?
others
- epoch
- mini_batch_size
- learning_rate
- num_cells
Number of time points the model sees before making prediction
can be smaller than seasonalities, the model will lag one year anyhow
DeepAR instance types
CPU / GPU
Single / Multi machine
economically feasible: C4.2xlarge / C4.4xlarge
move to gpu if necessary
CPU-only for inference
may need larger instances for tuning
BlazingText
good for?
- text classification
predict labels for sentences (not documents)
supervised
useful in web searches, information retrieval - word2vec
word embedding
useful for NLP but is not an NLP algorithm
Machine translation, Sentiment Analysis
works on individual words, not sentences or dox
BlazingText input
- Text Classification
one sentence per line
__label__{label} {sentence}
also possible to use augmented manifest text format
{“source”:””,”label”:””}
- word2vec
a text file with one training sentence/line
word2vec of BlazingText. what modes of operation are available?
- Cbow (continues bag of words)
order of the words is being thrown out and just the words themselves matter - skip-gram
n-gram (which order of the words does matter) - Batch skip-gram
distributed of computation over many cpu nodes.
BlazingText hyperparameters:
- Word2vec: mode (batch_skipgram, skipgram, cbow) learning_rate window_size vector_dim negative_samples
- Text Classification: epoch learning_rate word_ngrams vector_dim
Blazing Text instance types
- cbow, skipgram
ml.p3.2xlarge
any single cpu or single gpu instance work - batch_skipgram
single or multiple cpu instances (scale horizontally) - text classification
C5 for < 2GB training data
larger dataset, single GPU instance (ml.p2/3.xlarge)
Object2vec
Like word2vec but for arbitrary objects
low dimensional dense embedding of high-dimensional objects
compute nearest neighbors of objects
Visualize clusters
Genre predictions
Recommendations (similar items to users)
unsupervised
does the Object2vector performs unsupervised or supervised ?
unsupervised
so you don’t need to train it. it can automatically figure out what similarities are based on the inherit of data within its features
Object2vec inputs
data must be tokenized into integers
consist of pairs of tokens and/or sequences of tokens
- sentence - sentence
- labels - sequence ( genre to description)
- product - product
- user - item
object2vec encoder choices?
Average-pooled embeddings
CNN
Bidirectional LSTM
object2vec, how to use?
process data into JSON lines and shuffle it
train with two input channels, two encoders and a comparator
choose an encoder
Comparator is followed by a feed-forward neural network
object2vec hyperparameters
dropout early stopping epochs learning rate batch size layers activation function optimizer weight decay
Encoders: (hcnn, bilstm, pooled_embedding)
enc1_network
enc2_netwok
object2vec instance type
Single machine CPU / GPU
Multi GPU is ok
ml. m5.2xlarge
ml. m5.4xlarge
ml. m5.12xlarge
ml. p2.xlarge
object2vec inference instance type
use ml.p2.2xlarge
inference_preferred_mode environment variable to optimize for encoder embeddings rather than classification or regression