Quiz #3 Flashcards
What three partial derivatives must we calculate for backpropagation in a convolutional layer?
- dL/dh_in = dL/dHout * dHout/dHin (i.e. the partial derivative of the loss w.r.t. to the input from the previous layer. This is what gets passed back to the previous layer.)
- dL/dk = dL/dHout * dHout/dK (i.e. the partial derivative of the loss w.r.t. the kernel values)
- dL/dh_out (i.e. the partial derivative of the loss w.r.t. the output from the current layer. Remember that this is given because it is the “upstream gradient”)
When calculating dL/dK, a kernel pixel does not affect all the values in the output? (True/False)
False, it does impact all the values of the output map. This is because we stride the kernel across the image, and we share weights in output maps.
In a convolutional layer, when calculating the partial derivative of the loss w.r.t. the output of the layer (dL/dHout), we must incorporate ALL the upstream gradients and apply the chain rule over all the output pixels? (True/False)
True. This is because a single kernel pixel impacts the entire output since the kernel is strided across the image and weights are shared.
If a node in a computation graph impacts multiple values in the output, what operation must be applied in the backward pass to ensure that information from each of those individual connections is incorporated in the backprop update?
We SUM the gradients from each of the upstream connections.
If we take the partial derivative of the output pixel located at (r, c) w.r.t the kernel pixel located at (a’, b’), what expression represents the value of dY(r,c)/dK(a’,b’) if a’=b’=0?
dY(r,c)/dK(a’,b’) = x(r + a’, c + b’), so if a=b=0 then the derivative for this location is simply x(r, c)
When calculating the partial derivatives for backpropagation in a convolutional layer, it is unnecessary to calculate the partial derivative of the loss L with respect to the input x (i.e. dL/dx) because that derivative does not impact the kernel weight value updates? (True/False).
False. While it’s true that dL/dx isn’t needed for updating the kernel values, this derivative is important because it is the gradient that gets passed back to the previous layer.
What gradient needs to be calculated in order to pass back to the previous layer?
dL/dx, i.e. the partial derivative of the loss w.r.t the input of the current layer.
For input pixel x(r’, c’), what impact does this pixel have on the output when calculating the gradient dL/dx?
It impacts the neighborhood around it (where part of the kernel touches it).
When calculating the loss w.r.t. the input x (dL/dx), each pixel in the output is impacted by the input pixel? (True/False)
False. Since we’re striding the kernel across the input x, only the region where the kernel touches that input pixel are impacted. In those regions, we have to sum the gradients, hence the impact of all the pixels in this neighboring region.
When calculating the gradient for a max pooling layer, every input pixel into the max pool layer impacts the gradient? (True/False)
False. The entire point of the max pooling operation is to perform dimensionality reduction by zeroing out all non-max pixels within the kernel region. Since only one pixel in the region will have a non-zero value, the gradients with respect to any other pixel in the region are also zero.
A single pixel deep in a multi-layered CNN is only sensitive to the receptive field from the n -1 layer? (True/False)
False. A single pixel in the deeper layers is impacted a larger receptive field from the previous layer, which in turn is influenced by a larger receptive field from the previous layer, and so on. This is what gives CNN their representational power.
What was the first major 21st century CNN architecture and when was it introduced?
AlexNet in 2012
We tend to use fewer convolutional kernels (i.e. feature maps) as we go deeper into the network? (True/False)
False, generally speaking.
What was the first modern CNN architecture to use ReLU instead of sigmoid or tanh?
AlexNet
What activation function is used in AlexNet?
ReLU (it was the first to do this)
What are the 5 key aspects of the AlexNet architecture (per the lectures)?
- ReLU instead of sigmoid or tanh
- Specialized normalization layers
- PCA-based data augmentation
- Dropout
- Ensembling (7 models were trained together)
As we go deeper into a CNN, the receptive field increases?
True
What layers uses the most memory and why?
Convolutional Layers. Because we have to store the activations we obtained from the forward pass because the gradient calculation requires them for the backward pass. Since the output from the forward pass is so large (we’re striding across the entire image, remember) this leads to a large memory footprint.
Convolutional layers tend to have more parameters than FC layers? (True/False)
False. Convolutional layers have a higher memory footprint, but FC layers have many more parameters since every weight is connected.
What layers tend to have the most parameters and why?
Fully connected layers. This is because every weight is (as implied by its name) connected.
For a fully connected layer with 12 input neurons, 10 output neurons and 3 channels, how many parameters are there (excluding bias terms)?
12103 = 360
What are the two key aspects of the VGG architecture?
- Repeated application of blocks:
- 3x3 conv (stride=1, padding=1)
- 2x2 max pool (stride=2)
- Very large number of parameters (mostly from big FC layers)
What are some of the main architectural differences between VGG and Alexnet?
- Alexnet used a large stride, but this loses information. VGG uses a much smaller stride (1 for conv layers, 2 for maxpool) to preserve information.
Roughly how many trainable parameters are required for VGG architectures versus Alexnet?
Hundreds of millions for VGG compared to 60-70M for AlexNet
What are some of the key ideas used in the Inception architecure?
- Repeated blocks
- Multiscale features (i.e. concatenating convolutional features created using different kernel sizes and using the concatenated stack as the final output map).
What is one of the downsides of the Inception architecture?
The use of multiscale features means that if each layer uses N multiscale convolutional features, we’ll have to perform N number of convolutions, instead of just a single one as in a normal architecture.
What is the key idea of Residual Blocks?
- Help prevent issues with vanishing gradients
- Allow information from a layer to propagate to any future layer (forwards or backwards!)
They are useful because they improve gradient flow.
What is optimization error?
It is the idea that even if your NN can theoretically perfectly model the world, there’s no guarantee that your optimization algorithm can find an optimal set of weights that will achieve that level of performance.
What are the three types of error that pose a challenge to generalization?
- Optimization error
- Estimation error
- Modeling error
What is estimation error?
It is the idea that even if we find a set of weights that works well on the training set, there isn’t a guarantee that it will generalize to the test data. This could be because of overfitting, learning features that are good for the training set but don’t generalize to the test set, etc.
What is modeling error?
It is the idea that their may be a disconnect between how the world actually works (reality) versus what the model is actually capable of representing. This could be because of insufficient capacity of the model, or using a model that isn’t suited to the task (for example trying to simple multi-class logistic regression for semantic segmentation - there’s no set of weights that could reasonable manage that complexity for such a simple model)
In the context of transfer learning, when performing fine tuning we only update the parameters in the last layer? (True/False)
False. When fine tuning all the parameters are updated by training the pre-trained model on our smaller, domain specific dataset.
In the context of transfer learning, when freezing the feature layer only the weights in that final layer are updated during training? (True/False)
True (this is often done when there isn’t enough data to train from scratch).
What are two reasons you might want to reconsider using transfer learning for some specific problem?
- If the source dataset you train on is very different from the target dataset
- If you have enough data for the target domain (if so, then probably the only benefit of using transfer learning will be faster convergence)
What are four visualization methods we can use to try to understand what a trained NN has learned?
- Weights
- Activations (output maps)
3 Gradients - Robustness to perturbation
Using dimensionality reduction, we can plot the activations of any layer (conv, linear, etc.) in 2D to try to understand the output space visually? (True/False)
True. PCA and t-SNE (most common) are frequently used to do this.
What is a Saliency Map and what is it useful for?
The idea behind a saliency map is that we can backprop through a network all the way back to the image (or any arbitrary point in the computation graph) and look at the sensitivity of the loss to individual pixel changes. Large sensitivity implies important pixels.
When visualizing gradients of loss w.r.t. an input image, why do we use the gradient of the classifier scores BEFORE the softmax layer?
Because the softmax layers can also improve the loss by “pushing down” the scores of the non-predicted classes to try to improve separability.
What is guided backprop used for?
Many areas of an input image might actually DECREASE the feature activations. This can make trying to visualize gradients difficult. Guided backprop zeros out the negative gradients so that we only see the POSITIVE contributions to the activation.
Why is optimizing the input image to GENERATE examples to increase class scores or activations useful, and how do we do this in practice?
It can be used to aid interpretability. Specifically, it can visually show us a great deal about what examples (not in the training set) are able to activate the network. We can do this by performing gradient ascent instead of descent.