Quiz 2 - Optimization, CNN Flashcards
Loss surface geometries difficult for optimization
- local minima
- plateaus
- saddle points
- gradients of orthogonal directions are zero
- (min for one direction, max for another)
- gradients of orthogonal directions are zero
tanh function
- min: -1, max: 1
- centered
- saturates at both ends
- gradients
- vanishes at both ends
- computationally heavy
parameter sharing
regularize parameters to be close together by forcing sets of parameters to be equal
Normalization is helpful as it can
- improve gradient flow
- improve learning
Color jitter
- Used for data augmentation
- add/subtract a small or large value to RGB channels in an image
Data Augmentation
- Perform a range of transformations to a dataset
- increases data for free
- should not change meaning of data
- ex: flip image, black/white, crop
The key principle for NN training
- Monitor everything to understand what is goin gon
- loss/accuracy curves
- gradient statistics/characteristics
- other aspects of computation graph
Sanity checks for learning after optimization
- Check bounds of loss function
- Check initial loss at small random weight values (-log(p) for CE)
- start w/o regularization and make sure loss increases
- simplify dataset to make sure model can properly (over)fit before applying regularization
- to ensure that model capacity is enough
- model should be able to memorize
L2 regularization results in a solution that is ___ sparse than L1
L2 regularization results in a solution that is less sparse than L1
Why is initialization of parameters important
- determines how statistics of outputs (given inputs) behave
- determines if the gradients vanish at the beginning (dampening learning)
- ie. gradient flow
- allows bias at the start (linear)
- faster convergence
What suggests overfitting when looking at validation/training curve?
- validation loss/accuracy starts to get worse afer a while
Shared Weights
- Advantage
- reduce params
- explicitly maintain spatial information
- Use same weights/params in computation graph
sigmoid function
- Gradient will be vanishingly small
- Partial derivative of loss wrt weights (used for gradient descent) will be a very small number (multiplied by a small upstream gradient)
- pass back the small gradients
- Forward pass high values
- causes larger and larger forward values
- Issues in both directions
- computationally heavy
ReLU
- min: 0, max: infinity
- outputs always positive
- no saturation on the positive end
- better gradient flow
- gradients
- 0 if x <= 0 (dead ReLU)
- other ReLUs can make up for this
- 0 if x <= 0 (dead ReLU)
- computationally cheap
Sigmoid is typically avoided unless ___
you want to clamp values from [0,1] (ie. logistic regression)
Simpler Xavier initialization (Xavier2)
N(0,1) * square root (1 / nj)
How to Prevent Co-Adapted Features
- Dropout Regularization
- Keep nodes with probability p
- nodes less than p get set to 0 activation
- Choose nodes to mask out at each iteration
- multiply the nodes by a [0 1] mask
- Note: no nodes are dropped during testing
- Scale Weights at test time by p so that input/outputs have similar distributions
- At test time, all nodes are active so need a way to account for this
- Keep nodes with probability p
Fully Connected Neural Network
more and more abstract features from raw input
not well-suited for images
Why does dropout work
- Model should not rely too heavily on a particular feature
- Probability (1 - p) of losing the feature it relies on
- Equalizes the weights across all of the feature
- Training 2n neural networks
- n - number of nodes
- 2n distinct variations of mask
- ensemble effect
Pooling Layer
Layer to explicitly down-sample image or feature maps (dimensionality reduction)
What is the number of parameters for a CNN with Kn kernels and 3 channels?
N * ( k1 * k2 *…* kn * 3 + 1)
L2 regularization
- L2 norm
- encourage small weights (but less zeros than L1)
Sigmoid Function Key facts
- min: 0, max: 1
- outputs are always positive
- saturates at both ends
- gradient
- vanishes at both ends
- always positive
Definition of accuracy with respect to TP, TN, FP, FN
TP + TN / (TP + TN + FP + FN)
Per-Parameter Learning Rate
- Dynamic learning rate for each weight
- Examples
- RMSProp
- Adagrad
- Adam
How can you mitigate the problem with Adam
- Time-varying bias correction
- beta1 = 0.9
- beta2 = 0.999
Difference between Convolution and Cross-Correlation
- Convolution: starts at end of the kernel and move back
- Cross-correlation: start in the beginning of the kernel and move forward (same as for image)
- as if applying already flipped kernel
- dot product moving along the image
T/F: The existence of local minima is the main issue in optimization
- False - Other aspects of the loss surface cause issues
- Noisy gradient estimates (ie. from mini-batches)
- Saddle points
- ill-conditioned loss surface
- curvature/gradients higher in some directions
Normalization as a layer (algorithm)
note: small epsilon used for numerical stability
Each node in a NN for Convolution NN receives ___
- Input from a K2 x K1 window (image patch)
- region of input is called “receptive field”
- Advantage
- reduce parameters
- explicitly maintain spatial information
T/F: With dropdout regularization, nodes are dropped during testing
False - All nodes are kept.
What does a tiny loss change suggest?
too small of a learning rate
Which non-linearity is the most common starting point?
ReLU
T/F: In backprop and auto diff, the learning algorithm needs to be modified depending on what’s inside
False
L1 Regularization
- L1 Norm
- encourages sparcity (lots of small close to zero values in weights, only a few non zeros)
Convolution has the property of _____
- equivariance
- if feature translated a little bit, output values move by the same translation
- regardless of if pooling layer is involved
Method to get around loss geometries (ie. pleataus or saddle points)
- Momentum
- Decay the velocity over time for past gradients and add the current gradient
- Weight update uses the velocity (not the gradient)
- Used to “pop out” of local plateaus or saddle points
Nesterov Momentum
- Rather than combine velocity with current gradient, go along with velocity first and then calculate gradient at new point
- We know velocity is probably a reasonable direction
What is the False Positive rate (FPR)?
fp / (fp + tn)
Deep learning involves complex, compositional, non-linear functions, which cause the loss landscape to be ____
extremely non-convex
Change in loss indicates speed of ____
learning