Quiz 3 Flashcards
As you add more convolution + pooling layers, what do each pixel represent?
Each pixel of a deep layer represents a larger receptive field from a previous layer/input.
ImageNet
1.2 million images, 1000 classes.
Type of errors: Optimization error
Not find good weights to model a function
(Bad optimization algorithm)
Type of errors: Estimation error
Minimizing training error but doesn’t generalize to test set.
(Overfitting, learning features that don’t generalize well)
Type of errors: Modeling error
Given simple model, no set of weights can model the real world task.
Type of errors: Case study of multi-class logistic regression (MCLR) vs AlexNet
Which has high modeling error?
MCLR has high modeling error because model is very simple. Just can’t model complexity of real world.
Type of errors: Case study of multi-class logistic regression vs AlexNet
What kind of errors would AlexNet have, and why?
AlexNet may have smaller modeling error than MCLR but same degree of estimation error could occur.
Possibly higher optimization error because a complex architecture is harder to optimize.
Key idea of transfer learning
Reuse features learned on large dataset to learn new things
Describe transfer learning in 3 steps
- Train on large-scale dataset (may be provided for you)
- Take custom data and initialize the network with weights trained in step 1
- Continue to train on new dataset.
Limitations of transfer learning
Won’t work well if target task is very different (e.g. using pretrained model learned to classify natural image to sketches)
Benefit of transfer learning
Significantly reduces amount of labeled data needed to accomplish a task
Using a larger capacity model will always reduce estimation error
False. No regularization could lead to increasing estimation error.
Transfer learning: Example of what network changes you may need to make from a pretrained model to your own
Replace last layer with fully-connected for output nodes per new category
Transfer learning: Ways to train from pretrained model’s weights
- Update all parameters
- Freeze parts of the network (e.g. only tune fully connected layers)
Transfer learning: Why would you want to “freeze” parts of your network
Reduces the number of parameters that you need to learn given you new data set.
(If you don’t have enough data, you may not be able to fine-tune all the features in your network)
Transfer learning: T/F - If you have a large data set for a target domain, training from random initialization may result in faster convergence
True
Transfer learning: Expalin the three data regimes with respect to data set size and generalization error
- Small data region - not enough data, hard to reduce error
- Power-law region - training data size continues to linearly improve error
- Irreducible error region - useful data saturated to point of irreducible error
Modern networks: What was the key innovation introduced by AlexNet that made it a breakthrough in deep learning?
ReLU activation
Modern networks: Which one of these architectures is known for its simplicity with a focus on using only 3x3 convolutional filters?
VGGNet used 3x3 convolutional filters exclusively
Modern networks: Which architecture introduced the concept of residual learning, addressing the vanishing gradient problem and allowing the training of very deep networks?
ResNet introduced the concept of residual learning, where shortcut connections (or skip connections) were added to the network, allowing the gradient to flow more directly during training, thus addressing the vanishing gradient problem.
Modern networks: Which architecture uses inception modules? Explain what they are
InceptionNet.
Uses multiple filter sizes in parallel to capture different features
Modern networks: Which architecture was known for removing FC layers at the end of the network? What did it replace it with?
ResNet
Used global average pooling instead of FC layers. Global average pooling reduces overfitting and the total number of parameters in the network.
CNN: During forward propagation in a convolutional layer, what operation(s) is performed between the input and the kernel?
element-wise multiplication and summation
CNN: What is the purpose of backpropagation in the context of convolutional layers?
To compute the gradients for the kernel/filter
CNN: During backpropagation in a convolutional layer, what operation is performed to compute the gradients for the kernel?
Element-wise multiplication betwen gradients of the loss wrt output and input, then summed.
CNN: What is the purpose of padding in a CNN?
To preserve spatial dimensions. Otherwise deep layers becomes smaller and smaller.
CNN: Valid padding vs same padding
Valid: No padding, window always within input image
Same: Padding added to keep output size equal to input
CNN: Why use max-pooling
Reduces spatial dimensions through downsampling. Adds invariance to translation of features.
CNN: Invariance
Property where a model is robust to certain transformations in the input.
Practically, this explains how a CNN may be able to classify an object in an image regardless of where in the image it is located.
CNN: Equivariance
Property where a model can maintain the relationship between different elements after a transformation occurs (e.g. scaling, rotation, time shift)
CNN: How is invariance achieved by CNNs
Shared weights and bias
CNN: Equivariance
Convolution layers maintain spatial relationships between features.
E.g. If an image rotates, the convolution will also rotate.
CNN: CNN vs FC - which has higher memory usage
CNN
CNN: CNN vs FC - which has more parameters
FC
CNN: How to calculate gradient of a kernel during backwards pass
Multiply downstream gradient elements into corresponding receptive field. Then add all the receptive fields together.
CNN: Given a 3x3 kernel, the top-left cell’s kernel weight affects all pixels in the image
True
CNN: T/F - Given a constant kernel size, adding more layers increases the receptive field exponentially.
False. Adding more convolutional layers
increases the receptive field size linearly, as each extra layer increases the receptive field size by the kernel size.
How to visualize FC layer
Reshape weights for a node back into size of image
How to visualize CNN layer
For each kernel scale values from 0-255 and visualize. Each kernel becomes a feature map.
t-SNE
Performs non-linear mapping of high dimensional data to 2D space. Preserve pair-wise distances.
What can a visualization output (aka activation/filter) map show with respect to the input?
Given an input image and a convolution kernel in the network, we can view what area of the kernel had the highest activation.
Why can visualization interpretability be difficult?
- No intrinsic measure of utility. Need user studies to measure usefulness of visualization.
- Neural networks learn “distributed representation” - 1:1 mapping of node to feature not guaranteed.
Gradient ascent
Updates the input in the direction of the gradient (rather than opposite in gradient descent)
Guided backprop
Applies ReLU forward and zeroes out negative gradients in addition of it.
Improves visualization by only keeping positive gradients.
Saliency map
Visualizes area of the image with high gradients
How to use saliency map for bias
See which area of the image the network focused on (using dog vs snow to classify wolf example)
Grad-CAM
Generates heat maps highlighting regions of an input image that contribute the most to a specific class prediction
Grad-CAM - How does it work?
Computes gradient of the target class score with respect to the feature maps of the last convolutional layer. Reweight feature maps per channel and apply ReLU.
Difference between Grad-CAM and Guided Grad-CAM
Guided Grad-CAM multiplies guided backprop and Grad-CAM.
One practical use of gradient ascent
Class visualization
White-box attacks
Attacker has complete picture of the target model (network, params, data)
Black-box attacks
Attacker has limited or no picture of the target model.
Generally uses trial-and-error attempts to craft adversarial examples.
Key idea from Geirhos, “ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness”
CNNs tend to be more biased towards texture than shape. Remediating this bias improves accuracy and robustness.
Losses in style transfer
Style-loss function - minimize squared diff between gram matrices
Content-loss function - match features of content image and generate image
Gram matrix
Square matrix that represents relationships between vectors.
Difference of segmentation networks vs classification
Predicts classes for each pixel.
Encoder-decoder CNN architecture - key idea
Decoders are symmetrical to forward. Takes small feature maps and upsamples them back to the original image.
Max unpooling
Puts back the max output value back into the receptive field when decoding. Non-max pixels are left as zero.
What does max unpooling and deconvolution do with overlapping windows
Sums them
Deconvolution (transposed convolution)
Each pixel in the input is multiplied across all kernels values, then “stamped” to the output dimension.
U-net
Uses skip connections like ResNet but in a encoder-decoder network.
Single-stage object detection
Task of identifying and setting a bounding box for an identified object.
Single-stage object detection - what are its losses?
Cross-entropy loss for classification + Mean squared error for bounding box
Multi-headed architecture
When an architecture performs several tasks with shared features.
Single-shot detector (SSD) - key idea
Uses a grid and for each grid makes K bounding boxes. Estimates refined boxes across multiple layers. Selects box with highest confidence score among group of overlapping boxes for an object.
YOLO - what makes it unique
Predict bounding box + classification in a single pass.
Special loss function to minimize both errors at once.
Mean average precision in the context of bounding boxes
Take intersection of bounding box (pred vs truth) and divide it by the union to determine wellness of fit. Calculate precision/recall curve and calculate its average precision over all classes.
Two-stage object detection
Step 1 - determine regions of interest
Step 2 - classify those regions
One way two-stage object detection detect objects. But slow.
Unsupervised learning
Fast R-CNN - key idea
Use bounding boxes within feature maps, then map to input image.
Fast R-CNN - what is its benefit
Reuses computation
ROI Pooling - key idea
Applies a fixed grid to the feature map and applies max pooling to each cell in the grid with respect to the corresponding feature map.
ROI Pooling - what is its benefit
Can backpropagate
Faster R-CNN - key idea
Uses a region proposal network (RPN) to generate candidate regions. Take top-K and classify.
Mask R-CNN - Key Idea
Applies mask to boxes to detect which pixels is an object
Given an input image
1 2 3
4 5 6
7 8 9
and filter:
1 0
0 -1
Compute the forward operation
For the top-left element of the output:
(11) + (20)
(40) + (5(-1))
Result: -5
For the top-right element of the output:
(21) + (30)
(50) + (6(-1))
Result: -6
For the bottom-left element of the output:
(41) + (50)
(70) + (8(-1))
Result: -12
For the bottom-right element of the output:
(51) + (60)
(80) + (9(-1))
Result: -9
The resulting 2x2 output matrix:
-5 -6
-12 -9
Given a gradient:
1 2 3
4 5 6
7 8 9
and filter:
1 0 -1
2 0 -2
1 0 -1
Compute gradient with respect to the filter for the top-left element of the gradient (dL/d(1))
dL/d(1) = (11) + (22) + (33) + (44) + (55) + (66) + (77) + (88) + (9*9)
= 285
Given a forward pass of stride 2 of a 4x4 input and a 2x2 kernel, what is:
The output shape
Output Size= (4 - 2) / 2 + 1
Given a forward pass of stride 2 of a 4x4 input and a 2x2 kernel, what is the shape of:
Downstream gradient
2x2 (same as output)
Given a forward pass of stride 2 of a 4x4 input and a 2x2 kernel, what is:
Gradient wrt kernel
2x2 (same shape as kernel)
Given a backward pass with stride 1 of a 4x4 input, 2x2 kernel and 2x2 gradient, what is the shape of:
Gradient wrt input
4x4
Given:
Input - 32x32x3
Kernel: 5x5
Padding: 2
Stride: 1
Number of filters: 10
What is the parameter size?
760
Formula = (Channels * Kernel * Kernel + Bias) * Filters
= (3 * 5 * 5 +1) * 10
= 760
Given an input (28x28x3) what is the memory requirement
2353
Memory requirement is the product of input and channel