Quiz 2 Flashcards

Question

Initialization: What happens if you initialize with small activations

Answer 1

Be in the linear regime or close to it in the nonlinear space, and you will have a strong gradient to learn from

Answer 2

All learns the same thing.

Answer 3

1) Random sample from small normal distribution 𝑁(𝜇=0,𝜎=0.01) 2) Random sample from uniform distribution

Answer 4

No a priori reason why some weights should be greater

Answer 5

False. More sensitive with deeper network because activations increasingly get smaller.

Answer 6

Maintain the variance at the output to be similar to that of the input. Keeps each layer's variance the same.

Answer 7

Noisy gradient estimates (due to taking MiniBatches) Saddle points Ill conditioned loss surface, where the curvature is high in one direction but not the other

Answer 8

Local minima plateaus saddle points, a point that is a min in one axis but a max in another

Answer 9

Overcomes plateaus by adding exponential moving average of the gradient. Helps move off of areas with low gradients.

Answer 10

Calculates gradient AFTER applying momentum term

Answer 11

Use 2nd order derivatives to get information about the loss surface

Answer 12

The ratio between the smallest and largest eigenvalue of a hessian Tell us how different the curvature is along different dimensions

Answer 13

Apply per-parameter learning rates

Answer 14

Adapts the learning rate for each parameter based on the historical gradients Larger gradients have a rapid decrease in LR, while those with small gradients get a slower decrease. Pro: Works well in a gently sloped parameter space. Con: Can prematurely make learning rate too small.

Answer 15

Takes AdaGrad but replaces calculation by an exponential moving average

Answer 16

Like Adagrad / RMSProp but includes momentum terms

Answer 17

Applies the a sign function to the weights in addition to regularization term. Results in sparse parameters 𝐽=𝛼 sign( 𝑊^{𝐽−1}+𝐽(𝑊^{𝐽−1}) )

Answer 18

Applies a regularization term to the weights at each update 𝐽=𝛼 𝑊^{𝐽−1}+𝐽(𝑊^{𝐽−1})

Answer 19

Dropout is a technique in which a set of parameters are randomly masked (ie a matrix of 1/0's) to prevent a subset of params from learning. Makes the model less reliant on specifically effective parameters

Answer 20

1) Scale outputs or weights by the masking probability "p" W_test = W * p 2) Scale by 1/p during training.

Answer 21

Batch: Normalizes activations along the batch dimension Layer: Normalizes activations along the feature (channel) dimension for each data point in the mini-batch

Answer 22

Normalizes the activations of a layer across a mini-batch of data, which helps stabilize and speed up training by reducing internal covariate shift.

Answer 23

Take running average taken during training.

Answer 24

Improves gradient flow Allows higher learning rates Reduces dependence on initialization Differentiable

Answer 25

Sufficient batch sizes must be used to get stable per-batch mean/variance.

Answer 26

Right before activation

Answer 27

False. This is because we have normalized out the first- and second-order statistics, which is all that a linear network can influence

Answer 28

True. In a deep neural network with nonlinear activation functions, the lower layers can perform nonlinear transformations of the data, so they remain useful.

Answer 29

True. The bias term should be omitted because it becomes redundant with the β parameter applied by the batch normalization reparametrization.

Answer 30

False. It is important to apply the same normalizing μ and σ at every spatial location within a feature map, so that the statistics of the feature map remain the same regardless of spatial location.

Answer 31

𝑁(0,1)×sqrt(𝑛_j / 2)

Answer 32

Network goes through the forward/backward pass using the entire dataset. Pro: Provides deterministic updates. Con: Computationally expensive.

Answer 33

Network goes through f/b using only one example. More noisy updates, but can have regularization effect by pushing model out of local minima. Doesn't take advantage of parallelization

Answer 34

Takes N samples from the empirical distribution and runs f/b pass. Benefits from parallel processing. More stable than SGD.

Answer 35

Model is initially trained on a related task that has labeled data (supervised learning) before fine-tuning it on the target task of interest. Knowledge gained during the initial training phase can serve as a useful starting point for the model when tackling the target task.

Answer 36

K (F1 * F2 * D + 1) K: Number of kernels F: Filter dimensions D: Number of input channels

Quiz 2 Flashcards

(60 cards)