Quiz 2 - Optimization, CNN Flashcards
Loss surface geometries difficult for optimization
- local minima
- plateaus
- saddle points
- gradients of orthogonal directions are zero
- (min for one direction, max for another)
- gradients of orthogonal directions are zero
tanh function
- min: -1, max: 1
- centered
- saturates at both ends
- gradients
- vanishes at both ends
- computationally heavy
parameter sharing
regularize parameters to be close together by forcing sets of parameters to be equal
Normalization is helpful as it can
- improve gradient flow
- improve learning
Color jitter
- Used for data augmentation
- add/subtract a small or large value to RGB channels in an image
Data Augmentation
- Perform a range of transformations to a dataset
- increases data for free
- should not change meaning of data
- ex: flip image, black/white, crop
The key principle for NN training
- Monitor everything to understand what is goin gon
- loss/accuracy curves
- gradient statistics/characteristics
- other aspects of computation graph
Sanity checks for learning after optimization
- Check bounds of loss function
- Check initial loss at small random weight values (-log(p) for CE)
- start w/o regularization and make sure loss increases
- simplify dataset to make sure model can properly (over)fit before applying regularization
- to ensure that model capacity is enough
- model should be able to memorize
L2 regularization results in a solution that is ___ sparse than L1
L2 regularization results in a solution that is less sparse than L1
Why is initialization of parameters important
- determines how statistics of outputs (given inputs) behave
- determines if the gradients vanish at the beginning (dampening learning)
- ie. gradient flow
- allows bias at the start (linear)
- faster convergence
What suggests overfitting when looking at validation/training curve?
- validation loss/accuracy starts to get worse afer a while

Shared Weights
- Advantage
- reduce params
- explicitly maintain spatial information
- Use same weights/params in computation graph
sigmoid function
- Gradient will be vanishingly small
- Partial derivative of loss wrt weights (used for gradient descent) will be a very small number (multiplied by a small upstream gradient)
- pass back the small gradients
- Forward pass high values
- causes larger and larger forward values
- Issues in both directions
- computationally heavy
ReLU
- min: 0, max: infinity
- outputs always positive
- no saturation on the positive end
- better gradient flow
- gradients
- 0 if x <= 0 (dead ReLU)
- other ReLUs can make up for this
- 0 if x <= 0 (dead ReLU)
- computationally cheap
Sigmoid is typically avoided unless ___
you want to clamp values from [0,1] (ie. logistic regression)
Simpler Xavier initialization (Xavier2)
N(0,1) * square root (1 / nj)
How to Prevent Co-Adapted Features
- Dropout Regularization
- Keep nodes with probability p
- nodes less than p get set to 0 activation
- Choose nodes to mask out at each iteration
- multiply the nodes by a [0 1] mask
- Note: no nodes are dropped during testing
- Scale Weights at test time by p so that input/outputs have similar distributions
- At test time, all nodes are active so need a way to account for this
- Keep nodes with probability p
Fully Connected Neural Network
more and more abstract features from raw input
not well-suited for images
Why does dropout work
- Model should not rely too heavily on a particular feature
- Probability (1 - p) of losing the feature it relies on
- Equalizes the weights across all of the feature
- Training 2n neural networks
- n - number of nodes
- 2n distinct variations of mask
- ensemble effect
Pooling Layer
Layer to explicitly down-sample image or feature maps (dimensionality reduction)
What is the number of parameters for a CNN with Kn kernels and 3 channels?
N * ( k1 * k2 *…* kn * 3 + 1)
L2 regularization
- L2 norm
- encourage small weights (but less zeros than L1)
Sigmoid Function Key facts
- min: 0, max: 1
- outputs are always positive
- saturates at both ends
- gradient
- vanishes at both ends
- always positive
Definition of accuracy with respect to TP, TN, FP, FN
TP + TN / (TP + TN + FP + FN)







