Quiz 3 Flashcards
Optimization error
Even if your neural network can perfectly model the real world, your optimization algorithm may not be able to find the weights that model that function.
Estimation error
Even if we do find the best hypothesis, this best set of weights or parameters for the neural network, that doesn’t mean that you will be able to generalize to the testing set.
Modeling error
Given a particular neural network architecture, your actual model that represents the real world might not be in that space.
Optimization error scenario
You were lazy and couldn’t/didn’t optimize to completion
Estimation error scenario
You tried to learn model with finite data
Modeling error scenario
You approximated (didn’t?) reality with model
More complex models lead to ______ modeling error
Smaller
Transfer learning steps
1) Train on large-scale dataset
2) Take your custom data and initialize the network with weights trained in Step 1
3) Continue training on new dataset
Ways to apply transfer learning
Finetune, freeze
Finetune
Update all parameters
Freeze (feature layer)
Update only last layer weights (used when not enough data)
When is transfer learning less effective?
If the source dataset you train on is very different from the target dataset
As we add more data, what do we see in generalization error?
Error continues to get smaller / accuracy continues to improve
LeNet architecture
two sets of convolutional, activation, and pooling layers, followed by a fully-connected layer, activation, another fully-connected, and finally a softmax classifier
LeNet architecture shorthand
INPUT => CONV => RELU => POOL => CONV => RELU => POOL => FC => RELU => FC
LeNet architecture good for
Number classification / MNIST dataset
AlexNet architecture
Eight layers with learnable parameters. The model consists of five layers with a combination of max pooling followed by 3 fully connected layers and they use Relu activation in each of these layers except the output layer.
AlexNet activation function
ReLU instead of sigmoid or tanh. The first to do so
AlexNet data augmentation
PCA-based (principle component analysis)
AlexNet regularization
Dropout (the first to use)
AlexNet used ______ to combine predictions from multiple networks
Ensembling
VGGNet used ____ modules / blocks of layers
repeated
All convolutions in VGG are ___
3 x 3
Most memory usage in VGG is in
Convolution layers
Most of the parameters in VGG are
In the fully connected layer
VGG max pooling
2x2, stride 2
Number of parameters in VGG
Hundreds of millions
Number of parameters in AlexNet
60 million
Key idea behind Inception
Repeated blocks and multi-scale features
Inception uses _____ filters
parallel
Residual neural networks are easy to ___
optimize
Equivariance
if input undergoes a transformation, the output will also undergo the same transformation that is T(x)→T(a).
Invariance
if an input undergoes a transformation, the output is unchanged. That is T(x)→a.
We want equivariance in ________ layers
intermediate / convolutional
We want invariance in _____ layers
output (also in rotation)
Convolution is ______ Equivariant
Translation
Max pooling is invariant to ________
Permutation
Style transfer
1) Take first image, compute the features
2) Take the generated image, starting with a zero or random image, and also compute the features
3) Take a style image and compute those features
4) Change generated image to minimize both losses at the same time
Gram matrix
Represents feature correlations across different layers in the neural network
How Gram matrix works
1) Take a particular layer in a CNN
2) Take a pair of channels within the feature map
3) Compute the correlation, or dot product, between the two feature maps
4)
Loss function in style transfer
Minimize the squared difference between the gram matrices (the Gram matrix of style/Gram matrix of original image and the Gram matrix of the generated image) –> this results in two losses. Total loss is the two losses with some weighting.
Model with well-calibrated predictions
Logistic regression
Model with poorly calibrated predictions (overconfident)
ResNet
Group calibration
The scores for subgroups of interest are calibrated or equally miscalibrated
A classifier is well-calibrated if
The probability of the observations with a given probability score of having a label is equal to the proportion of observations having that label
Platt scaling requires
An additional validation dataset
Platt scaling
Learn parameters a, b so that the calibrated probability is sigmoid(az + b) where z is a parameter and b is a constant
Difference between Platt scaling and temperature scaling
Temperature scaling applies Platt scaling to multi-class classification using softmax
Limitations of calibration
Group based (what characteristic denotes the groups?), the inherent tradeoffs on calibration
The Fairness Impossibility Theorems
It is impossible for a classifier to achieve both equal calibration and error rates between groups, if there is a difference in prevalence between the groups and the classifier is not perfect
Positive Predictive Value
PPV = TP/(TP + FP)
What does an impossibility theorem obtain?
For any three (or more) measures of model performance derived from the confusion matrix, in a system of equations with three more equations, p is determined uniquely: if groups have different prevalences, these quantities cannot be equal
Transposed convolution
Take each input pixel, multiply by learnable kernel, “stamp” it on input
Large sensitivity of loss implies what
Important pixels
What do you have to find in saliency maps?
The gradient of classifier scores (pre-softmax). Take absolute value of gradient and sum across all channels
What gets zeroed out in guided backprop?
Negative gradients (we only pass back the positive gradients) for forward and backwards pass
Gradient ascent
Compute the gradient of the score for a particular class that we care about with respect to the input image. Rather than subtracting the learning rate times the gradient, we’ll add the learning rate times the gradient
Defenses against adversarial attacks
Training with adversarial examples, perturbations, noise, or re-encoding of attacks
Cross entropy
Easy examples incur a non-negligible loss, which in aggregate mask out the harder, rare examples
Focal loss
Down weights easy examples to give more attention to difficult examples
Focal loss formula
FL(p) = -(1 - p) y log(p)
Focal loss is used to
Address the issue of the class imbalance problem
Receptive field defines
What set of input pixels in the original image affect the value of this node or activation deep inside the network
As you get deeper into the neural network, the receptive field
continues to increase over and over