Quiz 2 Flashcards
Receptive fields
Each node only receives input from a πΎ1ΓπΎ2 window (image patch).
The region from which a node receives its input from is called a receptive field.
Shared Weights
Nodes in different locations can share features.
Uses the same weights/parameters in the computation graph.
- Reduce parameters to (πΎ1ΓπΎ2+1)
- Explicitly maintain spatial information
Learning Many Features
Weights are not shared across different feature extractors.
Reduce parameters to (πΎ1ΓπΎ2+1)βπ where M is the number of features to be learned
Convolution
In mathematics a convolution is an operation on two functions f and g producing a third function that is typically viewed as a modified version of one of the original functions, giving the area of overlap between the two functions as a function of the amount that one of the original functions is translated.
T or F: Convolutions are linear operations
True
What are CNN hyperparameters
- in_channels(int): Number of channels in the input image
- out_channels(int): Number of channels produced by the convolution
- Kernel_size(int;tuple): Size of convolving kernel
- stride (int;tuple;optional): denotes the size of the stride used by the convolution (default is 1)
- padding (int;tuple;optional): Zero padding added to both sides of the input (default is 0)
- padding_mode (string):
βzerosβ,βreflectβ,βreplicateβ,βcircularβ (default βzerosβ)
Output size formula for a vanilla convolution operation
Specifically:
(N - F + 2P) / S + 1 = output_dim_1 * output_dim_2 * N_channels
N: Input dimension
F: Filter dimension
P: Padding
S: Stride
1: Bias term
βvalidβ convolution
where the kernel fits inside the image
T or F: Larger the filter the smaller the shrinkage
False. Larger filter = larger shrinkage.
βSameβ convolution
zero-padding the image borders to produce an output the same size as the raw input
CNN: Max pooling
For each window, calculate its max.
Pros: No parameters to learn.
CNN: Stride
Movement of the convolution layer.
CNN: Pooling layer
Make the representations smaller and more manageable through downsampling.
Only pools width and height, not depth
CNN: Cross-correlation
Takes the dot product of a small filter (also called a kernel or weights) and an overlapping region of the input image or feature map.
Doesnβt flip the convolution layer.
CNN: T or F - Using a stride greater than 1 results in loss of information.
True. Stride > 1 implies jumping over some pixels.
CNN: Output size of vanilla/valid convolution vs full convolution
Vanilla: m-k+1
Full: m+k-1
CNN: Benefit of pooling
Makes the representation invariant to small changes in the input
CNN: Full convolution
Enough zeros are added to the borders such that, every pixel is visited k times in each direction. This results in an image of size m+k-1.
Full = Bigger size than original
Sigmoid
Min=0; max=1
Output is always positive
Saturates at both ends
Gradients vanish at each end (converging to 0 or 1 - gradient approaches zero)
Always positive
Computationally complexity high due to exponential term
tanh
min=-1; max=1; and we note that is centred
Saturates at both ends (-1,1)
Gradients: vanish at both ends ; always positive
medium compexity as tanh is not as simple as say multiplication
ReLU
Min=0, Max= β; always positive
Not saturated on the positive side
gradients: 0 when X <= 0 (aka dead ReLU); constant otherwise (doesnβt vanish which is good)
Cheap: doesnβt come much easier than max function
T or F: ReLU is differentiable
Technically no, but only at zero.
Initialization: What happens if you initialize close to a bad local minima
Poor gradient flow
Initialization: What happens if you initialize with large activations
Reach saturation quickly
Initialization: What happens if you initialize with small activations
Be in the linear regime or close to it in the nonlinear space, and you will have a strong gradient to learn from
Initialization: What happens if you initialize all weights with a constant
All learns the same thing.
Initialization: Common practice
1) Random sample from small normal distribution π(π=0,π=0.01)
2) Random sample from uniform distribution
Initialization: Why are equal (in terms of sampling from distribution), small weights preferred
No a priori reason why some weights should be greater
Initialization: T or F - Deeper networks are less sensitive to initialization
False. More sensitive with deeper network because activations increasingly get smaller.
Initialization: Fan-in Fan-out rule
Maintain the variance at the output to be similar to that of the input. Keeps each layerβs variance the same.
Optimization: Issues that hinder optimization
Noisy gradient estimates (due to taking MiniBatches)
Saddle points
Ill conditioned loss surface, where the curvature is high in one direction but not the other
Optimization: Loss surfaces that can cause problems
Local minima
plateaus
saddle points, a point that is a min in one axis but a max in another
Optimization: Momentum
Overcomes plateaus by adding exponential moving average of the gradient. Helps move off of areas with low gradients.
Optimization: Nesterov Momentum
Calculates gradient AFTER applying momentum term
Optimization: How to use Hessian
Use 2nd order derivatives to get information about the loss surface
Optimization: Condition number
The ratio between the smallest and largest eigenvalue of a hessian
Tell us how different the curvature is along different dimensions
Optimization: General idea of techniques like Adam
Apply per-parameter learning rates
Optimization: Adagrad
Adapts the learning rate for each parameter based on the historical gradients
Larger gradients have a rapid decrease in LR, while those with small gradients get a slower decrease.
Pro: Works well in a gently sloped parameter space.
Con: Can prematurely make learning rate too small.
Optimization: RMSProp
Takes AdaGrad but replaces calculation by an exponential moving average
Optimization: Adam
Like Adagrad / RMSProp but includes momentum terms
Regularization: L1 norm
Applies the a sign function to the weights in addition to regularization term. Results in sparse parameters
π½=πΌ sign( π^{π½β1}+π½(π^{π½β1}) )
Regularization: L2 norm
Applies a regularization term to the weights at each update
π½=πΌ π^{π½β1}+π½(π^{π½β1})
Regularization: Dropout
Dropout is a technique in which a set of parameters are randomly masked (ie a matrix of 1/0βs) to prevent a subset of params from learning.
Makes the model less reliant on specifically effective parameters
Regularization: What needs to be done with dropout during inference
1) Scale outputs or weights by the masking probability βpβ
W_test = W * p
2) Scale by 1/p during training.
Batch norm: Diff between batch vs layer norm
Batch: Normalizes activations along the batch dimension
Layer: Normalizes activations along the feature (channel) dimension for each data point in the mini-batch
Batch Norm: Definition
Normalizes the activations of a layer across a mini-batch of data, which helps stabilize and speed up training by reducing internal covariate shift.
Batch Norm: How to use batch norm during inference
Take running average taken during training.
Batch norm: Pros
Improves gradient flow
Allows higher learning rates
Reduces dependence on initialization
Differentiable
Batch norm: Cons
Sufficient batch sizes must be used to get stable per-batch mean/variance.
Batch norm: Where to apply in network
Right before activation
Batch norm: T or F - Batch norm is useful for linear networks
False. This is because we have normalized out the first- and second-order statistics, which is all that a linear network can influence
Batch norm: T or F - Batch norm is useful for deep networks
True. In a deep neural network with nonlinear activation functions, the lower layers can perform nonlinear transformations of the data, so they remain useful.
Batch norm: T or F - The bias term should be omitted during batch norm
True. The bias term should be omitted because it becomes redundant with the Ξ² parameter applied by the batch normalization reparametrization.
Batch norm: T or F - For CNNs, you should use different mean/variance parameter values for each layer
False. It is important to apply the same normalizing ΞΌ and Ο at every spatial location within a feature map, so that the statistics of the feature map remain the same regardless of spatial location.
Initialization: Variance of ReLU
π(0,1)Γsqrt(π_j / 2)
Batch: Batch gradient descent
Network goes through the forward/backward pass using the entire dataset.
Pro: Provides deterministic updates.
Con: Computationally expensive.
Batch: Stochastic gradient descent
Network goes through f/b using only one example.
More noisy updates, but can have regularization effect by pushing model out of local minima.
Doesnβt take advantage of parallelization
Batch: Mini-batch gradient descent
Takes N samples from the empirical distribution and runs f/b pass.
Benefits from parallel processing.
More stable than SGD.
Supervised pretraining
Model is initially trained on a related task that has labeled data (supervised learning) before fine-tuning it on the target task of interest.
Knowledge gained during the initial training phase can serve as a useful starting point for the model when tackling the target task.
Formula to calculate number of parameters and bias
K (F1 * F2 * D + 1)
K: Number of kernels
F: Filter dimensions
D: Number of input channels