Modeling Flashcards
Types of Neural Network
Feedforward
Convolutional Neural Network
Recurrent Neural Network
Convolutional Neural Network (CNN)
Image Classification
Recurrent Neural Network
for sequences
e.g. Stock Prices, Words in a sentence…
- LSTM, GRU
LSTM full format
Long Short Term Memory
GRU full format
Gated Recurrent Unit
There might be Feature-Location Invariant, what to do?
like not sure where the sign is in our image, then use CNN
adversarial example
An adversarial example is an instance with small, intentional feature perturbations that cause a machine learning model to make a false prediction.
another e.g. Sentiment Analysis
MaxPooling1D
MaxPooling2D
MaxPooling3D
distill the input down to the bear essence of what you need to analyse
Conv1D
Conv2D
Conv3D
these layer types does the actual convolution
1D like text
2D like Images
3D like 3D volume metric data
Typical Image process using CNN. what’s the process?
Conv2D:
- does the convolution
MaxPooling2D:
- distill down and shrink image
Dropout:
- Prevents overfitting
Flatten:
flatten data to feed it into a perceptron
Dense:
hidden layer of neurons, perceptron
Dropout:
Softmax:
choose the final classification that comes out of the neural network
Name some Specialised Architectures of CNN
LeNet-5:
- handwriting recognition
AlexNet:
- Image Classification, Deeper than LeNet
GoogLeNet:
- deeper than LeNet but better performance
it uses Inception Modules.
ResNet:
- Residual Network, even deeper but maintains performance using Skip Connections
Recurrent Neural Network (RNN) topologies
Where sequence matters
- Sequence to Sequence
- Sequence to Vector
- Vector to sequence
- Encoder -> Decoder
Sequence to Sequence NN
time-series
output time-series
e.g. Stock prices
Sequence to Vector NN
e.g. Words in a sentence to sentiments
Vector to Sequence NN
e.g. produce a caption from an image
Encoder Decoder
e.g.
Sequence to Vector to Sequence
capture words in a french sentence and put them into vectors and then translate to english
Training RNN
Backpropagating both through the neural network and also time
Really hard
Sensitive to hyperparameters
Resource intensive
LSTM
maintains both long term and short term states
GRU
Gated Recurrent Unit
Simplified LSTM
What if pick some wrong choices in training a RNN?
it might lead to a RNN that doesn’t converge at all
AWS offers for Training a neural network?
Apache MXNet on EMR
P2, P3, G Instance types
Deep Learning AMI
Major Components of Tuning a Neural Network? (hyperparameters)
Some knobs and dials:
- Learning Rate
- Batch size
- epochs
Learning Rate
Gradient Descent or other means
Too high LR:
- overshoot the optimal solution
Too Low LR:
- take too long to find the optimal solution
Batch Size
Small batch sizes can work their way out of local minima more easily
Large Batch sizes can end up getting stuck in the wrong solution
Random shuffling at each epoch can make this look like very inconsistent results from run to run
Learning Rate and Training
Small LR will increase the training time
Large LR can overshoot the correct solution
Regularization Techniques. what they do?
it prevents overfitting
If you are overfitting?
try simpler model
try fewer neurons
try fewer layers
Dropout:
- remove some neurons at random at each training set to force the model to spread itself and learning more
Early Stopping:
- on the point that accuracy goes high but validation accuracy not
Vanishing Gradient Problem
Opposite of Exploding Gradients
Vanishing Gradient is when the slope of the learning curve approaches zero
Addressing Vanishing Gradient Problem
Multi-level hierarchy
- train sub-networks instead of the whole network
LSTM
Residual Network
- ResNet, for object recognition
Better choices of Activation Function
- ReLu
Gradient Checking
a debugging technique
Numerically check the derivatives computed during training
Useful for validating code of neural network
L1 and L2 Regularazation
L1 is sum of the weights
L2 is sum of square of the weights
to prevent over fitting
L1 and L2 differences?
L1: Sum of weights
- performs feature selection
- Computationally inefficient
- sparse output
L2: Sum of square of weights
- All features remain considered. just weighted
- computationally efficient
- Dense output
Why L1 then over L2?
Feature selection reduces the dimensionality
- out of 100 features, maybe only 10 endup with non-0 coefficients
- resulting sparsity can make up for its computational inefficiency
on the other side, if you think all the features are important, then go for L2
Confusion Matrix
T/PN
F/PN
Predicted Yes, Actual Yes
- True Positive
Predicted Yes, Actual No
- False Positive
Predicted No, Actual Yes
- False Negative
Predicted No, Actual No
- True Negative
Multi-class confusion matrix
including a heat map it’s useful for multi-class classification
Precision
TP / TP+FP
Captured over Number of nominated
AKA
- Percent of relevant results
- Correct Positives
when FP are important
e.g. Medical screening, drug testing