Quiz 3 - CNN Architecture, Visualization, Advanced CV Architecture Flashcards
T/F: Visualization makes assessing interpretability easy
False
- Visualization leads to some interpretable representations, bt they may be misleading or uninformative
- Assessing interpretability is difficult
- Requires user studies to show usefulness
- Neural networks learn distributed representation
- no one node represents a particular feature
- makes interpretation difficult
Steps to obtaining Gradient of Activation with respect to input
- Pick a neuron
- Run forward method up to layer we care about
- Find gradient of its activation w.r.t input image
- Can first find highest activated image patches using its corresponding neuron (based on receptive field)
T/F: A single-pixel change can make a NN wrong
True (single-pixel attacks)
Shape vs. Texture Bias
- Ex: take picture of cat and apply texture of elephant
- Humans are biased towards shape (will see cat)
- Neural Networks are biased towards texture (will classify cat as elephant, likely)
Estimation Error
Even with the best weights to minimize training error, doesn’t mean it will generalize to the testing set (ie. overfit or non-generalizable features in training)
Limitations to Transfer Learning
- If source dataset you train on is very different from target dataset
- If you have enough data for the target domain, it just results in faster convergence
____ can be used to detect dataset bias
Gradient-based visualizations
Saliency Maps
- Shows us what we think the neural network may find important in the input
- sensitivity of loss to individual pixel changes
- large sensitivity imples important pixels
What is non-semantic shift for label data?
Two images of the same thing, but different
Ex: Two pictures of bird but different – one a picture one a sketch

T/F: CNNs have scale invariance
True - but only some
low-labeled setting: domain generalization
- Source
- multiple labeled
- target
- unknown
- shift
- non-semantic
T/F: For larger networks, estimation error can increase
True - With a small amount of data and a large amount of parameters, we could overfit
Backward Pass: Deconvnet
- Pass back only the positive gradients
AlexNet - Key aspects
- ReLU instead of sigmoid/tanh
- Specialized normalization layers
- PCA-based data augmentation
- Dropout
- Ensembling
Gram Matrix
- Take a pair of channels in a feature map of n layers
- Get correlation (dot product) between features and then sum it up
- Feed into larger matrix (Gram) to get correlation of all features
- Get Gram matrix loss for style image with respect to generated image
- Get Gram matrix loss for content image with respect to generated image
- Sum up the losses with parameters (alpha, beta) for proportion of total loss contributed by each Gram matrix

Low-labeled setting: Semi-supervised learning
- Source
- single labeled (usually much less)
- target
- single unlabeled
- shift
- none
low-labeled setting: cross-category transfer
- Source
- single labeled
- target
- single unlabeled
- shift
- semantic
T/F: We can generate images from scratch using gradients to obtain an image with maximized score for a given class?
True - Image optimization
Creating alternating layers in a CNN (convolution/non-linear, pooling, and fully connect layers at the end) results in a ________ receptive field .
It results in an increasing receptive field for a particular pixel deep inside the network.
What is the problem for visualization in modern Neural Networks?
Small filters such as 3x3
Small convolution outputs are hard to interpet
Increasing the depth of a NN leads to ___ error (higher/lower)
higher - hard to optimize (but can be mitigated with residual blocks/skip connections)
Since the output of of convolution and pooling layers are ______ we can __________ them
Since the output of of convolution and pooling layers are (multi-channel) images we can sequence them just as any other layer
What is semantic shift for labeled images?
Both objects are image but different things

Most parameters in the ___ layer of a CNN
Fully Connected Layer - input x output dimensionality + bias
Normal backpropagation is not always the best choice for gradient-based visualizations because…?
- You may get parts of image that decrease the feature activation
- likely lots of these input pixels
Grad-CAM
- Feed image through CNN (only convolution part) for last Convolution Feature Map (most abstract features closest to classification on the network).
- Following CNN with any Task-specific network (classification, question/answering)
- Backprop until convolution
- Obtain a feature map the size of the original feature maps
- Obtain per-channel weighting (global average pooling for each channel of gradient) for neuron importance, then normalize
- Multiply feature maps with their weighting
- Feed through ReLU to obtain only positive features
- Final result, values that are important will have higher values

VGG - Key Aspects
- Repeating particular blocks of layers
- 3x3 conv with small strides
- 2x2 max pooling stride 2
- Very large number of parameters
Convolution layers have the property of _____ and output has the property of _______
(choose translation equivariance or invariance for each)
Convolution layers have the property of translation equivariance and output has the property of invariance
Note: Some rotation invariance and scale invariance (only some)
Visualizing Neural Network Methods
- Weights (kernels)
- See what edges are detected in kernels
- Activations
- What does image look like in activation layer
- Gradients
- Assess what is used for the optimization itself
- Robustness
- See what weaknesses/bias are of NN
The gradient of the Convolution layer Kernel is equivalent to the _________
Cross-Correlation between the upstream gradient and input (until K1xK2 output)

Defenses for adversarial attacks
- training with adversarial examples
- perturbations, noies, or re-encoding of inputs
- there are no universal methods to prevent attacks
T/F: Computer vision segmentation algorithms can be applied directly to gradients to get image segments
True

Exploring the space of possible architecture (methods)
- Evolutionary Learning and Reinforcement Learning
- Prune over-parameterized networks
- Learning of repeated blocks is typical
The gradient of the loss with respect to the input image is equivalent to ____
Convolution between the upstream gradint and the kernel

Backward Pass:
Guided Backpropagation
- Zero out gradient for negative values in forward pass
- Zero out negative gradients
- Only propagate positive influence
- Like a combination of backprop and deconvnet
Gradient Ascent
- Compute the gradient of the score for a particular class with respect to the input image
- Add the learning rate times gradient to maximize score (not subtracting)
- Algorithm
- Start from random/zero image
- Compute forward pass
- Compute gradients
- Perform Ascent
- Iterate
- Note: Uses scores to avoid minimizing other class scores
- Need regularization as well

How do we represent similarity in terms of textures?
- Should remove most spatial information
- Key ideas revolved around summary statistics
- Gram Matrix
- feature correlations
We can take the activations of any layer (FC, conv, etc.) and perform _____________
- dimensionality reduction
- often to reduce to two dimensions for plotting
- PCA
- t-SNA (most common)
- non-linear mapping to preserve pair-wise distances
- good for visualizing decision boundaries (esp non-linear)
What is the power-law region for data effectiveness?
Region where generalization error (log-scale) decreases linearly with sufficient data

Modeling Error
Given a NN architecture, actual model that represents the real world may not be in that space. There may be no set of weights that model the real world.
Ie. a simple architecture or function may not be able to model complex reality (potentially low capacity)
What can you do to train a CNN if you don’t have enough data?
Transfer Learning -
- Train on large-scale dataset and optimize parameters
- Take custom data set and initialize the network with weights trained before (step 1)
- Replace last layer with new fully-connected layer for output nodes per category
- Continue to train on new dataset (finetune - update parameters, freeze feature layer - update only last layer weights if not enough data)
low-labeled setting: few-shot learning
- Source
- single labeled
- target
- single few-labeled
- shift
- semantic
Most memory usage is in the ___ layers of a CNN
convolution layers - large output
Residual block/ skip connections
Allow information from a layer to propagate to any future layer (with identity (ie no transform) )
can help with better gradient flow

low-labeled setting: domain adaptation
- Source
- single labeled
- target
- single unlabeled
- shift
- non-semantic
T/F: Saliency maps use the loss to assess importance of input pixels
False
- In practice, saliency maps find gradient of the classifier scores (pre-softmax)
- softmax and then loss function adds some complexity (weird effects in terms of the gradient)
How to preserve the content of an image
- Match features at different layers
- Use a loss for this
- optimize image by minimizing the difference between the images (content and generated images)
- Multiple losesses
- Backward edges going to same node are summed
- Loss is sum of the difference across the identified layers

Optimization Error
Optimization algorithm may not be able to find the weights that 100% model the world
T/F: We have reached the point in complex CNN architectures where more data is not/barely improving performance
False - The ‘Irreducible Error Region’ has not been reached
What does an input pixel affect at the output in convolution?
Neighborhood around it (where part of the kernel touches it)
Visualizing Weights for CNN Layers
- Fully Connect Layers
- Reshape weights for a node back into size of image, then scale to 0-255
- Convolution Layers
- For each kernel, scale values from 0-255 and observe:
- oriented edges
- color
- texture
- For each kernel, scale values from 0-255 and observe:
Receptive Field
Defines what set of input pixels in the original image affect the value of a particular node deep in the neural network.
Where does a kernel pixel affect an output image during the convolution operation?
Everywhere!
The pixels in the kernel stride across the entire input image
low-labeled setting: un/self-supervised
- Source
- single labeled
- target
- many labeled
- shift
- both/task
For larger networks, optimization error will likely ___ in size
increase - dynamics of optomization could get more difficult with deeper network
AlexNet - Architecture
Horizontal split architecture - couldn’t fit into one GPU
conv -> max pool -> norm (x2)
conv x 3 -> max pool
fully connected x3
T/F: CNNs do not have rotation invariance
False - They have some
A way to increase class scores or activations for an image
Gradient Ascent - optimization of an image to increase score for a particular class
Effectiveness of Transfer Learning
Surprisingly effective
Features learned for 1000 object categories will work well for the 1001st!
Generalizes even across tasks (classification to object detection)
For larger networks, modeling error will ___ in size
likely increase in size.
What was used to show the benefits of Neural Networks?
Large-scale data benchmarking
Inception Architecture
- Repeated blocks composed of simple layers
- parallel filters of different sizes
- 1x1 convolution, 3x3 convolution, 5x5 convolution, 3x3 max pooling -> filter concatenation
- increases computational complexity (4 times)
T/F: You need a large amount of pixel changes to make a network confidently wrong
False - Gradient ascent perturbations can make model confidently wrong (adversarial noise)
Key elements of practical application of saliency maps
- Find gradient of classifier scores (pre soft-max), instead of loss
- take absolute value of gradients
- sum across channels
- We don’t care specifically about RBG specifics
Visualizing Output Maps
- Visualization of activation/filter
- Larger early in the network
- Looking at activations across the input
- which images have the highest activation?
Computing the gradient of the loss with respect to the inputs for Convolution

Semantic Segmentation

Object Detection

Instance Segmentation

T/F: Fully connected layers explicitly retain spatial information
False
Converting Fully Connected Layers to Convolution Layers
- Each kernel has size of entire input
- Equivalent to Wx+b
- output is one scalar
- One kernel per output node
Resulting output for Image Segmentation Networks
Probability distribution over classes for each pixel.

Convolutions work on ____ input sizes
Convolutions work on arbitrary input sizes (because of striding)
Max Unpooling

In max-unpooling/deconvolution, contributions from multiple windows are ____
In max-unpooling, contributions from multiple windows are summed.

Deconvolution (“transposed convolution”)
Take each input pixel, multiply by learnable kernel, “stamp” it on output
Transfer Learning
Begin with a pre-trained trunk/backbone (e.g. network pretrained on ImageNet)
For encoder/decoder connections, you can ___ to bypass bottlenecks
skip connections
Object Detection
Given an image, output a list of bounding boxes with probability distribution over classes per box
What are the key problems to address with object detection?
Variable number of boxes
Need to determine candidate regions (position and scale) first
Architecture for Object Detection
- multi-headed
- classification
- predicting distribution over class labels
- regression
- predicting bounding box for each image region
- classification
- both heads share features
- jointly optimized (summing gradients)
Non-Maximal suppresssion (NMS)
Combining redundant boxes to find bounding box for object in image
Single-Shot Detector (SSD)
- uses grid idea as anchors
- different scales
- different aspect ratios
- tricks used to increase resolution (decrease subsampling ratio)

You Only Look Once (YOLO)
Single-scale
faster for same size than SSD
Coco Dataset
large-scle object detection, segmentation, and captioning dataset
Evaluation of bounding box for image threshold (steps)
- For each bounding box, calculate intersection over union (IoU)
- extract intersection over union with closest ground truth
- Keep only those with IoI > threshold
- Calculate Precision/Recall curve across classification probability threshold
- Calculate average precision (AP) over recall of [0, 0.1, 0.2, …, 1.0]
- Average over all categories to get mean Average Precision (mAP)

R-CNN
- Find regions of interests (ROIs) with object-like things
- Classify those regions (refine their bounding boxes)
Method to extract region of interest in an image
- unsupervised (non-learned) algorithms
- downsides
- 1+ second per image
- returns thousands of mostly backgrund images
- resize each candidate to full input size and classify
Downside of R-CNN
- Takes 1+ second per image
- return thousands of (mostly background) boxes
Inefficiency of R-CNN
Computations for convolutions are re-done for each image patch, even if overlapping
Fast R-CNN difference
- Reuse computation by finding regions in feature maps
- feature extraction once per image
Problem with R-CNN
- Variable input size to FC layers due to different feature map sizes
R-CNN fix for differing feature map sizes
- ROI Pooling
- Given an arbitraryily-sized feature map, we can use pooling across a grid (ROI Pooling Layer) to convert to fixed-sized representation
Faster R-CNN key difference
- Use Neural Networks for the region proposal
- Region Proposal Network (RPN)
- output: objectness score
- top k selected for classification
- complexity in implementation due to some non differentiable parts (gradient with respect to bounding box coordinates)
- Region Proposal Network (RPN)
Region Proposal Network (RPN)
- Neural Network model to find regions of objects
- Uses anchors in a grid
-
k anchor boxes
- various sizes and shapes
- hyperparameters
- various sizes and shapes
-
2k scores
- object or not-object like
- 4k coordinates
-
k anchor boxes
Two-stage object detection methods are ___ compared to single-stage methods (YOLO/SSD)
Two-stage object detection methods are slower but more accurate