Quiz 2 - Optimization, CNN Flashcards
Loss surface geometries difficult for optimization
- local minima
- plateaus
- saddle points
- gradients of orthogonal directions are zero
- (min for one direction, max for another)
- gradients of orthogonal directions are zero
tanh function
- min: -1, max: 1
- centered
- saturates at both ends
- gradients
- vanishes at both ends
- computationally heavy
parameter sharing
regularize parameters to be close together by forcing sets of parameters to be equal
Normalization is helpful as it can
- improve gradient flow
- improve learning
Color jitter
- Used for data augmentation
- add/subtract a small or large value to RGB channels in an image
Data Augmentation
- Perform a range of transformations to a dataset
- increases data for free
- should not change meaning of data
- ex: flip image, black/white, crop
The key principle for NN training
- Monitor everything to understand what is goin gon
- loss/accuracy curves
- gradient statistics/characteristics
- other aspects of computation graph
Sanity checks for learning after optimization
- Check bounds of loss function
- Check initial loss at small random weight values (-log(p) for CE)
- start w/o regularization and make sure loss increases
- simplify dataset to make sure model can properly (over)fit before applying regularization
- to ensure that model capacity is enough
- model should be able to memorize
L2 regularization results in a solution that is ___ sparse than L1
L2 regularization results in a solution that is less sparse than L1
Why is initialization of parameters important
- determines how statistics of outputs (given inputs) behave
- determines if the gradients vanish at the beginning (dampening learning)
- ie. gradient flow
- allows bias at the start (linear)
- faster convergence
What suggests overfitting when looking at validation/training curve?
- validation loss/accuracy starts to get worse afer a while

Shared Weights
- Advantage
- reduce params
- explicitly maintain spatial information
- Use same weights/params in computation graph
sigmoid function
- Gradient will be vanishingly small
- Partial derivative of loss wrt weights (used for gradient descent) will be a very small number (multiplied by a small upstream gradient)
- pass back the small gradients
- Forward pass high values
- causes larger and larger forward values
- Issues in both directions
- computationally heavy
ReLU
- min: 0, max: infinity
- outputs always positive
- no saturation on the positive end
- better gradient flow
- gradients
- 0 if x <= 0 (dead ReLU)
- other ReLUs can make up for this
- 0 if x <= 0 (dead ReLU)
- computationally cheap
Sigmoid is typically avoided unless ___
you want to clamp values from [0,1] (ie. logistic regression)
Simpler Xavier initialization (Xavier2)
N(0,1) * square root (1 / nj)
How to Prevent Co-Adapted Features
- Dropout Regularization
- Keep nodes with probability p
- nodes less than p get set to 0 activation
- Choose nodes to mask out at each iteration
- multiply the nodes by a [0 1] mask
- Note: no nodes are dropped during testing
- Scale Weights at test time by p so that input/outputs have similar distributions
- At test time, all nodes are active so need a way to account for this
- Keep nodes with probability p
Fully Connected Neural Network
more and more abstract features from raw input
not well-suited for images
Why does dropout work
- Model should not rely too heavily on a particular feature
- Probability (1 - p) of losing the feature it relies on
- Equalizes the weights across all of the feature
- Training 2n neural networks
- n - number of nodes
- 2n distinct variations of mask
- ensemble effect
Pooling Layer
Layer to explicitly down-sample image or feature maps (dimensionality reduction)
What is the number of parameters for a CNN with Kn kernels and 3 channels?
N * ( k1 * k2 *…* kn * 3 + 1)
L2 regularization
- L2 norm
- encourage small weights (but less zeros than L1)
Sigmoid Function Key facts
- min: 0, max: 1
- outputs are always positive
- saturates at both ends
- gradient
- vanishes at both ends
- always positive
Definition of accuracy with respect to TP, TN, FP, FN
TP + TN / (TP + TN + FP + FN)
Per-Parameter Learning Rate
- Dynamic learning rate for each weight
- Examples
- RMSProp
- Adagrad
- Adam
How can you mitigate the problem with Adam
- Time-varying bias correction
- beta1 = 0.9
- beta2 = 0.999

Difference between Convolution and Cross-Correlation
- Convolution: starts at end of the kernel and move back
- Cross-correlation: start in the beginning of the kernel and move forward (same as for image)
- as if applying already flipped kernel
- dot product moving along the image
T/F: The existence of local minima is the main issue in optimization
- False - Other aspects of the loss surface cause issues
- Noisy gradient estimates (ie. from mini-batches)
- Saddle points
- ill-conditioned loss surface
- curvature/gradients higher in some directions
Normalization as a layer (algorithm)
note: small epsilon used for numerical stability

Each node in a NN for Convolution NN receives ___
- Input from a K2 x K1 window (image patch)
- region of input is called “receptive field”
- Advantage
- reduce parameters
- explicitly maintain spatial information
T/F: With dropdout regularization, nodes are dropped during testing
False - All nodes are kept.
What does a tiny loss change suggest?
too small of a learning rate
Which non-linearity is the most common starting point?
ReLU
T/F: In backprop and auto diff, the learning algorithm needs to be modified depending on what’s inside
False
L1 Regularization
- L1 Norm
- encourages sparcity (lots of small close to zero values in weights, only a few non zeros)
Convolution has the property of _____
- equivariance
- if feature translated a little bit, output values move by the same translation
- regardless of if pooling layer is involved
Method to get around loss geometries (ie. pleataus or saddle points)
- Momentum
- Decay the velocity over time for past gradients and add the current gradient
- Weight update uses the velocity (not the gradient)
- Used to “pop out” of local plateaus or saddle points
Nesterov Momentum
- Rather than combine velocity with current gradient, go along with velocity first and then calculate gradient at new point
- We know velocity is probably a reasonable direction

What is the False Positive rate (FPR)?
fp / (fp + tn)
Deep learning involves complex, compositional, non-linear functions, which cause the loss landscape to be ____
extremely non-convex
Change in loss indicates speed of ____
learning
Adam
- Combining ideas of other algorithms
- Maintain both first and second moment statistics for gradients

Limitation of Linear Layers
- Images
- 1024 x 1024 = ~ a million elements (M)
- Fully connected layer (N)
- Parameters = M*N (weights) + N
- hundreds of million params for one layer
- More parameters => more data needed to fit
Ways to analyze non-linear functions in DL models
- min/max
- correspondence between input & output stats
- gradients
- at initialization
- at extremes
- computational complexities
Normalization methods
- subtract mean, divide by standard deviation
- (most common)
- this can be done per dimension
- whitening (ie. through PCA)
- (not common)
Combining Convolution and Pooling layers - Benefit?
- Invariance
- Pooling layer has invariance to translation of the features
- If feature is translated (moved) a bit, output values still remain the same
- pooling layer (ie max pooling) retains max values in patches as long as movement is not larger than pooling window
The velocity term is an ____ of the gradient
exponential moving average

Complexities of Batch Normalization
- During training, compute empirical mean and variance of mini batch over iteration (ie. normalizing a different amount each time)
- causes noise in estimation/mean variance
- During inference, stored mean/variance calculated on training set
- Sufficient batch sizes must be used to get stable per-batch estimates during training.
- issue with multi-GPU or multi-machine training
- pytorch uses centralized batch statistics to fix
Where should Batch Normalization be applied and why?
- where
- before every non-linearity
- why
- low/high values (un-normalized, imbalanced) cause saturation issues
Relationship between forward pass and gradients in CNNs
- Forward pass and gradients are opposite
- If forward = cross correlation, then gradients are convolutions (vice versa)
What is cowmask?
- Combine images with patterened masks to determine which images to take pixels from
- result is a complex transformation
- forces NN to be robust to occlusion (ie. object hidden behind another object in an image)
- Ground truth proportional to how many of pixels came from which image
Learning Rate Schedules
- Hand-coded ways to schedule learning rate
- Theoretical results rely on annealed learning rate
- (learning rate with decay)
- Empirically derived learning rate
- graduate student
- stare at loss curve and determine convergence
- step-scheduler
- ex: divide learning rate by 10 every few epoch
- exponential-scheduler
- cyclical scheduler (cosine scheduler)
- alternate base to max learning_rate on step size
- graduate student
2D Discrete Convolution
- Image
- input image
- Kernel
- applied to image
- initialize randomly and optimize
- our params (plus bias)
- Output filter / feature map
- output of image and kernel
Deeper networks are ___ sensitive to initialization
(more or less?)
- Deeper networks are more sensitive to initialization
- Activation gets smaller as you move in the layers, resulting in standard deviation shrinking
- result: smaller update
- Larger initial values
- result: saturation
T/F: We can have a flat loss curve but increasing accuracy
True - the correct class score only has to be slightly higher (argmax of P(Y = yi | X = xi) as the argmax of probability used in CE loss.
Recurrent Neural Network
Better for non-fixed size inputs (ie. sentences, phrases, etc.)
alternate architecture: transformers
Why don’t multi-kernel CNNs have kernels that learn the same filters?
The kernels are initialized to different vaues and have different local minima and gradients
T/F: You can have higher training loss
- True - Validation has no regularization so may perform better and is typically measured at the end of an epoch while training is measured as you go and averaged across the iterations (value is lowered by early in the epoch)
Why is depth important in a neural network
- modeling compositionality
- parameter efficiency
- dimensionality reduction
How to optimize to find good weights
- different optimizers have different biases
- different weight updates
- weight optimization
- regularization
- loss functions
noisy gradients
- caused by use of subset of data at each iteration to calculate the loss
- unbiased estimator with high variance
- slower convergence
Geometric transformation layers are more important for this type of DL problem
computer vision
How do you balance the standard deviation across the DL layers?
- Xavier Initialization
- Sample from a uniform distribution
- nj - fan in
- number of input nodes
- nj+1 - fan out
- number of output nodes

Condition Number
- Ratio of the largest and smallest value of the eigenvalue
- Tells us how different the curvature is along different dimensions
- high value
- SGD will make big steps in some dims, small in others
- second-order optimization methods divide steps by curvature
- expensive to compute
What are ROC Curves?
- TPR/FPR curves represent the inherent tradeoff between number of positive predictions and correctness of predictions
- AUC is area under curve - common single-number metric to summarize
The number of channels in the output map is equal to _____
the number of kernels
What is the problem with adagrad?
- learning rate will go to zero (because denominator is sum up gradients over iterations) as gradients accumulate
Relationship between loss function and other metrics
- Can be complex
- Metrics (not loss) are often not differentiable
- accuracy
- precision/recall
what is model capacity?
Number of parameters
regularization
any modification we make to a
learning algorithm that is intended to reduce its generalization error but not its
training error.
What is the problem with Adam?
- unstable in the beginning
- one or both moments will be tiny values
Max Pooling
Stride window across image but perform per-path max operation
(ie. take the max of pixels in the patch)
First step of designing the architecture
- understand data
- ask experts
- data types already have good architectures
- use what others have discovered
- understand the flow of gradients
- learning is not equal across the architecture
- could be bottlenecks
what happens when initializing to a constant value
- weights will be the same
- shared weights
- gradients will be the same
- as a result: all weights will be updated the same
Hessian
- Matrix of second-order gradients
- Gives information about the curvature of the loss surface
- Not often used in Deep Learning (computationally inefficient)
*
Geometric transformation
- Used for data augmentation
- translation
- rotation
- scale
- shear
Normalization can be done with learnable parameters
“Batch Normalization”
- gamma (scale)
- beta (shift)
- determine what extent to normalize
- or if not at all
What is the True Positive Rate (TPR)?
tp / (tp + fn)
Learning Many Features
- Weights are not shared across different feature extractors
- params (K1 x K2 + 1) * M
- M - number of features to learn
What suggests underfitting when looking at a validation/training curve?
- validation loss very close to training loss, or both are high
- should be able to get very low training loss in NN
leaky ReLU
- min: -infinity, max: infinity
- slightly negative slope with x <= 0
- slope can be parameterized
- no saturation
- gradients
- no dead neurons
- still cheap to compute
- subgradients
- not fully differentiable
T/F: There is one activation function best for all applications
False
How many parameters are learned in the max pooling layer?
None!
When to use cross-validatio
- expensive, not done often in NN
- useful if you may not have a lot of data
What is convolution
- Mathematical operation of two functions f and g producing a tthird function
- third function is modified version of original functions
- gives area of overlap between initial functions
- similar to cross-correlation
Convolution hyperparameters
- in_channels
- # channels in input image
- out_channels
- # channels produced by convolution
- kernel_size
- size of convolving kernel
- stride
- stride of the convolution (default: 1)
- stride when moving across image
- padding
- zero-padding added to both sides of the input
- padding_mode
- zeros, reflect, replicate, circular
What are the most crucial hyperparameters to tune?
- learning rate
- weight decay
Convolutional Neural Networks
Feature extractors for small local images
Better for images
Applied to sentences too
How does SGD + momentum work compared to adaptive methods?
- Typically generalizes better, but requires more tuning
Complexity of a model is only limited by _____
computation and memory
T/F: Hyperparameters cannot be independently tuned
- True - hyperparameters are interdependent
- ex - batch norm/dropout maybe not needed together
- learning rate should be changed proportionally to batch size
- gradients are more reliable/smooth
How can you mitigate the adagrad problem with gradients?
- Keep a moving average of squared gradients to avoid saturating the learning rate
- Doesn’t go to zero but decays

How do you tune hyperparameters?
- Start with coarser search
- learning rate - {0.1, 0.05, 0.03, 0.01, 0.003, 0.001, 0.0005, 0.0001}
Adagrad
- Use gradient statistics to reduce learning rate across iterations
- Accumulator takes previous accumulator plus the square of the gradient
- Weight update is learning rate / (square root of accumulator + epsilon)
- denominator sums up gradients over iterations
- directions with high curvature will have higher gradients and reduce learning rate

T/F: combinations of only linear layers have the same representational powers as single layer non-linear models
False - Combination of only linear layers has the same representational power as one linear layer
Striding across an image using larger steps results in _______
loss of information, dimensionality reduction(not recommended for dimred)
Convolutional & Pooling layers
- Alternating Convolution and Pooling layers
- Convolution + Non-linear layer
- feature extraction
- Pooling
- reduce dim of data (images patches 3x3 reduced to 1)
- End with a fully connected layer to classify
Why are images less conducive to fully connected linear layers?
- images features are spatially localized
- edge/contour detectors can extract data from image patches
- gradients (dark to light on images)
- small features are repeated
- color
- motifs (corners, etc.)
- edges
- edge/contour detectors can extract data from image patches
- no reason to believe one feature tends to appear in one location of image
- ie. bird beaks are not in the center of every image
T/F: In convolution NN we do not need to learnb location-specific features
- True
- Nodes in different locations can share features
- ie. image edges can be similar on different parts of a bird image (ie. 2 wings)
*
- ie. image edges can be similar on different parts of a bird image (ie. 2 wings)
T/F: For a learned kernel, Convolution and Cross-Correlation are treated differently
- False
- Kernels are randomly initialized and learned
- Doesn’t matter if we flip the kernel or not
- no difference between convolution and cross-correlation
what do loss (and then weights) turn to NaNs suggest?
- too high of a learning rate
- divide by zero
- forget the log of the loss (causing divergence)
How do you adapt the kernel for multi-channel input images?
- Use 3-channel kernels
- Use dot product with 3x3x3 kernel
- element-wise multiplication between kernel and image patch, summing them up
Backprop and auto diff allows us to optimize ___ composed of ____
Backprop and auto diff allows us to optimize any function composed of differentiable blocks