Quiz 2 - Optimization, CNN Flashcards

Question

Per-Parameter Learning Rate

Answer 1

* Dynamic learning rate for each weight * Examples * RMSProp * Adagrad * Adam

Answer 2

* Time-varying bias correction * beta₁ = 0.9 * beta₂ = 0.999

Answer 3

* Convolution: starts at end of the kernel and move back * Cross-correlation: start in the beginning of the kernel and move forward (same as for image) * as if applying already flipped kernel * dot product moving along the image

Answer 4

* False - Other aspects of the loss surface cause issues * Noisy gradient estimates (ie. from mini-batches) * Saddle points * ill-conditioned loss surface * curvature/gradients higher in some directions

Answer 5

note: small epsilon used for numerical stability

Answer 6

* Input from a K₂ x K₁ window (image patch) * region of input is called "receptive field" * Advantage * reduce parameters * explicitly maintain spatial information

Answer 7

False - All nodes are kept.

Answer 8

too small of a learning rate

Answer 9

* L1 Norm * encourages sparcity (lots of small close to zero values in weights, only a few non zeros)

Answer 10

* equivariance * if feature translated a little bit, output values move by the same translation * regardless of if pooling layer is involved

Answer 11

* Momentum * Decay the velocity over time for past gradients and add the current gradient * Weight update uses the velocity (not the gradient) * Used to "pop out" of local plateaus or saddle points

Answer 12

* Rather than combine velocity with current gradient, go along with velocity first and then calculate gradient at new point * We know velocity is *probably* a reasonable direction

Answer 13

fp / (fp + tn)

Answer 14

extremely non-convex

Answer 15

* Combining ideas of other algorithms * Maintain both first and second moment statistics for gradients

Answer 16

* Images * 1024 x 1024 = ~ a million elements (M) * Fully connected layer (N) * Parameters = M\*N (weights) + N * hundreds of million params for one layer * More parameters =\> more data needed to fit

Answer 17

* min/max * correspondence between input & output stats * gradients * at initialization * at extremes * computational complexities

Answer 18

* subtract mean, divide by standard deviation * (most common) * this can be done per dimension * whitening (ie. through PCA) * (not common)

Answer 19

* Invariance * Pooling layer has invariance to translation of the features * If feature is translated (moved) a bit, output values still remain the same * pooling layer (ie max pooling) retains max values in patches as long as movement is not larger than pooling window

Answer 20

exponential moving average

Answer 21

* During training, compute empirical mean and variance of mini batch over iteration (ie. normalizing a different amount each time) * causes noise in estimation/mean variance * During inference, stored mean/variance calculated on training set * Sufficient batch sizes must be used to get stable per-batch estimates during training. * **issue with multi-GPU or multi-machine training** * pytorch uses centralized batch statistics to fix

Answer 22

* where * before every non-linearity * why * low/high values (un-normalized, imbalanced) cause saturation issues

Answer 23

* Forward pass and gradients are opposite * If forward = cross correlation, then gradients are convolutions (vice versa)

Answer 24

* Combine images with patterened masks to determine which images to take pixels from * result is a complex transformation * forces NN to be robust to occlusion (ie. object hidden behind another object in an image) * Ground truth proportional to how many of pixels came from which image

Answer 25

* Hand-coded ways to schedule learning rate * Theoretical results rely on annealed learning rate * (learning rate with decay) * Empirically derived learning rate * graduate student * stare at loss curve and determine convergence * step-scheduler * ex: divide learning rate by 10 every few epoch * exponential-scheduler * cyclical scheduler (cosine scheduler) * alternate base to max learning\_rate on step size

Answer 26

* Image * input image * Kernel * applied to image * initialize randomly and optimize * our params (plus bias) * Output filter / feature map * output of image and kernel

Answer 27

* Deeper networks are _more_ sensitive to initialization * Activation gets smaller as you move in the layers, resulting in standard deviation shrinking * result: smaller update * Larger initial values * result: saturation

Answer 28

True - the correct class score only has to be slightly higher (argmax of P(Y = y_i | X = x_i) as the argmax of probability used in CE loss.

Answer 29

Better for non-fixed size inputs (ie. sentences, phrases, etc.) alternate architecture: transformers

Answer 30

The kernels are initialized to different vaues and have different local minima and gradients

Answer 31

* True - Validation has no regularization so may perform better and is typically measured at the end of an epoch while training is measured as you go and averaged across the iterations (value is lowered by early in the epoch)

Answer 32

* modeling compositionality * parameter efficiency * dimensionality reduction

Answer 33

* different optimizers have different biases * different weight updates * weight optimization * regularization * loss functions

Answer 34

* caused by use of subset of data at each iteration to calculate the loss * **unbiased** estimator with high variance * slower convergence

Answer 35

computer vision

Answer 36

* Xavier Initialization * Sample from a uniform distribution * n_j - fan in * number of input nodes * n_j+1 - fan out * number of output nodes

Answer 37

* Ratio of the largest and smallest value of the eigenvalue * Tells us how different the curvature is along different dimensions * high value * SGD will make big steps in some dims, small in others * second-order optimization methods divide steps by curvature * expensive to compute

Answer 38

* TPR/FPR curves represent the inherent tradeoff between number of positive predictions and correctness of predictions * AUC is area under curve - common single-number metric to summarize

Answer 39

the number of kernels

Answer 40

* learning rate will go to zero (because denominator is sum up gradients over iterations) as gradients accumulate

Answer 41

* Can be complex * Metrics (not loss) are often not differentiable * accuracy * precision/recall

Answer 42

Number of parameters

Answer 43

any modiﬁcation we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

Answer 44

* unstable in the beginning * one or both moments will be tiny values

Answer 45

Stride window across image but perform per-path max operation (ie. take the max of pixels in the patch)

Answer 46

* **_understand data_** * ask experts * data types already have good architectures * use what others have discovered * understand the flow of gradients * learning is not equal across the architecture * could be bottlenecks

Answer 47

* weights will be the same * shared weights * gradients will be the same * as a result: all weights will be updated the same

Answer 48

* Matrix of second-order gradients * Gives information about the curvature of the loss surface * Not often used in Deep Learning (computationally inefficient) *

Answer 49

* Used for data augmentation * translation * rotation * scale * shear

Answer 50

* gamma (scale) * beta (shift) * determine what extent to normalize * or if not at all

Answer 51

tp / (tp + fn)

Answer 52

* Weights are **not** shared across different feature extractors * params (K₁x K₂ + 1) \* M * M - number of features to learn

Answer 53

* validation loss very close to training loss, or both are high * should be able to get very low training loss in NN

Answer 54

* min: -infinity, max: infinity * slightly negative slope with x \<= 0 * slope can be parameterized * no saturation * gradients * no dead neurons * still cheap to compute * subgradients * not fully differentiable

Answer 55

* expensive, not done often in NN * useful if you may not have a lot of data

Answer 56

* Mathematical operation of two functions *f* and *g* producing a tthird function * third function is modified version of original functions * gives area of overlap between initial functions * similar to cross-correlation

Answer 57

* in\_channels * # channels in input image * out\_channels * # channels produced by convolution * kernel\_size * size of convolving kernel * stride * stride of the convolution (default: 1) * stride when moving across image * padding * zero-padding added to both sides of the input * padding\_mode * zeros, reflect, replicate, circular

Answer 58

* learning rate * weight decay

Answer 59

Feature extractors for small local images Better for images Applied to sentences too

Answer 60

* Typically generalizes better, but requires more tuning

Answer 61

computation and memory

Answer 62

* True - hyperparameters are interdependent * ex - batch norm/dropout maybe not needed together * learning rate should be changed proportionally to batch size * gradients are more reliable/smooth

Answer 63

* Keep a moving average of squared gradients to avoid saturating the learning rate * Doesn't go to zero but decays

Answer 64

* Start with coarser search * learning rate - {0.1, 0.05, 0.03, 0.01, 0.003, 0.001, 0.0005, 0.0001}

Answer 65

* Use gradient statistics to reduce learning rate across iterations * Accumulator takes previous accumulator plus the square of the gradient * Weight update is learning rate / (square root of accumulator + epsilon) * denominator sums up gradients over iterations * directions with high curvature will have higher gradients and reduce learning rate

Answer 66

False - Combination of only linear layers has the same representational power as one linear layer

Answer 67

loss of information, dimensionality reduction(not recommended for dimred)

Answer 68

* Alternating Convolution and Pooling layers * Convolution + Non-linear layer * feature extraction * Pooling * reduce dim of data (images patches 3x3 reduced to 1) * End with a fully connected layer to classify

Answer 69

* images features are spatially localized * edge/contour detectors can extract data from image patches * gradients (dark to light on images) * small features are repeated * color * motifs (corners, etc.) * edges * no reason to believe one feature tends to appear in one location of image * ie. bird beaks are not in the center of every image

Answer 70

* True * Nodes in different locations can *share* features * ie. image edges can be similar on different parts of a bird image (ie. 2 wings) *

Answer 71

* False * Kernels are randomly initialized and learned * Doesn't matter if we flip the kernel or not * no difference between convolution and cross-correlation

Answer 72

* too high of a learning rate * divide by zero * forget the log of the loss (causing divergence)

Answer 73

* Use 3-channel kernels * Use dot product with 3x3x3 kernel * element-wise multiplication between kernel and image patch, summing them up

Answer 74

Backprop and auto diff allows us to optimize _any function_ composed of _differentiable blocks_

Quiz 2 - Optimization, CNN Flashcards

(104 cards)