Quiz #2 Flashcards

Question

If we have a receptive field that DOES NOT share weights and is 3x3 pixels connected to 5 output nodes, how many parameters will there be to learn?

Answer 1

((K1 x K2) + 1) * N --> ((3*3) + 1)*5 = 50

Answer 2

False. There's no reason to assume that a pattern in an image that occurs in the center might not also be repeated or at some other time appear in some other arbitrary location.

Answer 3

Output nodes in different location sharing weights across the input space. For example W11 in the leftmost node would be the same as W11 in the rightmost node. We use shared weights so that we can learn spatial features that are invariant to simple affine transformations, e.g. translation, etc.

Answer 4

(K1*k2) + 1 = 3*3 + 1 = 10

Answer 5

False. Weights are shared for the SAME feature extractor across the spatial input, but they are NOT shared between DIFFERENT feature extractors, i.e. each feature extractor has its own independent set of weights.

Answer 6

(K1*K2 + 1) * M, where M is the number of features we want to learn --> (3*3 + 1)*4 = 40

Answer 7

False. Mathematically it's useful to flip the kernel to make the math work out more elegantly, since we're actually learning the kernel values in a CNN, the weights will be initialized randomly making the flipping operation superfluous is practice.

Answer 8

Simply take the dot product of the input with the kernel (i.e. element-wise multiply and sum).

Answer 9

Convolution. (this is the duality principle that arises in the forward/backward pass of a convolutional layer)

Answer 10

False. It is a simple linear operation.

Answer 11

Cross-Correlation. (this is the duality principle that arises in the forward/backward pass of a convolutional layer)

Answer 12

It only applies the kernel when it is fully within the image. The output size is: (H - K1 + 1) x (W - K2 + 1). The output dimensions will be SMALLER than the input.

Answer 13

This form of convolution can be used to force the input size to be the same as the output by adding padding to the input image (Zeros, mirrored, etc). In general the output is size is: (H - K1 + P1 + 1) x (W - K2 + P2 + 1)

Answer 14

(H - K1 + 1) x (W - K2 + 1) = (5 - 3 + 1) x (5 - 3 + 1) = 3x3

Answer 15

(H - K1 + P1 + 1) x (W - K2 + P2 + 1) = (5 - 3 + 1 + 1) x (5 - 3 + 1 + 1) = 4x4

Answer 16

Typically false. Using a stride greater than one results in losing information because we're skipping pixels.

Answer 17

C x K1 x K2, where C is the dimensionality of the input. For example, an RGB image would be 3 x K1 x K2.

Answer 18

Because we initialize the weights randomly, so as gradient descent is applied the weights in each map will tend to converge to different values as a result of their different starting state. However, it is still possible learn redundant feature maps, so random initialization is more of a heuristic than a guarantee.

Answer 19

Since we concatenate the feature maps in the output, we'll also have M output channels.

Answer 20

N * (K1 * K2 * 3 + 1) = 4 * (3 * 3 * 3 + 1) = 112. The number of channels in the output is simply equal to the number of filters, so 4.

Answer 21

Dimensionality reduction.

Answer 22

It is useful for dimensionality reduction of the input. It is performed by striding a window across the image, but instead of applying convolution, the max-operation is applied to every window. This gives us a scalar output from a matrix input. For example, a 3x3 window would have 9 elements, but by applying the max operation, this becomes a single value.

Answer 23

None. The max argument takes no arguments other than the input, so nothing needs to be learned.

Answer 24

False. Any differentiable function can be used (e.g. average, etc.). In practice though, it's uncommon to use something other than max pooling.

Answer 25

This combination allows learned features to exhibit some degree of INVARIANCE to simple affine transformations like translation. If the translation of some feature in the image is within the bounds of the pooling layer, it should still be recognized by the feature map.

Answer 26

False. Convolution has the property of 'Equivariance'. A translation of the feature results in the output being shifted by the same amount.

Answer 27

1. Invariance (features with small transformations/deformations should still activate the output) 2. Equivariance (no matter where the feature occurs in the image, the feature map will be activated, with the output values moving by the same translation)

Answer 28

1. Architecture 2. Data considerations 3. Training and Optimization 4. Machine Learning considerations

Answer 29

1. What modules (layers) should we use? 2. How should they be connected together? 3. Can we use domain knowledge to add architectural biases?

Answer 30

False. This is one of the downsides of a FC NN. Since every input is connected to all the weights, it can can only accept rigid inputs shapes.

Answer 31

1. What optimizer should we use? Different optimizers make different weight updates depending on the gradients. 2. How should we initialize the weights? If we initialize far away from the minima, can our optimizer actually get us there? 3. What regularizers should we use? DNN often have more parameters than data. Regularization is often a must to avoid overfitting. 4. What loss function is appropriate? Many different options available.

Answer 32

1. Balancing the trade off between model CAPACITY and AMOUNT OF DATA 2. Adding appropriate biases based on domain knowledge

Answer 33

The flow of gradients. This is crucial. If the gradient doesn't flow, the model can't learn. Gradients become smaller and smaller as they flow back towards the input.

Answer 34

False. The update is a function of the gradient. As we flow further back in the network, the gradients become smaller. Furthermore, when we actually make the update, we're multiplying the gradient by a learning rate so that the move we make becomes even smaller. This means different parts of the network may experience very different learning experiences.

Answer 35

True. This is called a bottleneck.

Answer 36

Yes, because we could simply multiply the layers together.

Answer 37

False. If we only use linear layers, the output response is simply a linear response of all the inputs and the weights (i.e. we could just multiple them together). It gives us no new representational power (we fix that by using non-linear activations in a NN).

Answer 38

The composition of multiple non-linear layers.

Answer 39

False. The gradient flow across linear layers is straightforward. Gradient flow across non-linear layers is not, and is strongly impacted by its shape.

Answer 40

1. Min/Max 2. Correspondence between input and output statistics 3. Gradients: 3. 1 At initialization (small values, etc) 3. 2 At extremes 4. Computational complexity

Answer 41

0 and 1, respectively.

Answer 42

False, it does saturate. It is always 1 as x --> +infinity, and always 0 as x --> -infinity

Answer 43

True. The min/max value of sigmoid is 0/1, which means the function is saturated at both ends which results in vanishing gradients.

Answer 44

Because the flow of gradients is how we make updates to DNN. Since we backpropagate gradients, if an upstream is small to begin with, it will become "vanishingly" small as it flows back.

Answer 45

False. It has the exponential term in it.

Answer 46

Min: -1, Max: 1

Answer 47

True. This is because it's range is -1 to 1

Answer 48

True. It's min/max values are -1 and 1

Answer 49

False. Just like sigmoid, it vanishes at both ends and is always positive.

Answer 50

Activation functions whose gradient is always positive.

Answer 51

False. Its min output is 0, max output +infinity

Answer 52

False. It's range is from 0 to +infinity.

Answer 53

Because the gradient is not saturated on the positive end (it goes to +infinity).

Answer 54

False. ReLU takes on a value of zero for any negative inputs, which means a ReLU neuron can effectively "die", if its input is always negative.

Answer 55

False. It is simply the max function, which is easy to compute.

Answer 56

Leaky ReLU is one popular alternative. It has a small, positive slope for all input values 0, so it's range is from -infinity, + infinity.

Answer 57

False. The Leaky ReLU function does not saturate for any input value, whereas normal ReLU does saturate for inputs <=0

Answer 58

Strictly false, but in practice it can be treated as piecewise linear, and the derivative for x <=0 and x > 0 calculated separately.

Answer 59

False. ReLU is the most common choice, but the best choice is a function of your problem at hand. No "one-size-fits-all" solution.

Answer 60

If you needed to clamp the outputs to the 0 to 1 range would be one case. In general sigmoid is not the best choice though.

Answer 61

False. It is extremely important. Imagine a case where you initialize the weights to all be within the saturation range of your activation functions. Your model would never learn anything.

Answer 62

False. Naturally, it has a big impact on the BEGINNING of training.

Answer 63

True. You could initialize parameters so that the inputs to the activation functions fall within the linear or nonlinear range to obtain different behavior and limit capacity.

Answer 64

This is a degenerate solution. If all the weights are the same, so will the gradients, so the model can never learn!

Answer 65

It prevents the model from starting in a biased state. Unless we have some good reason to think one feature over another is more important, we should treat the probability of any particular hypothesis in the weight space being true as uniform.

Answer 66

True. This is because in a deep net the activations get smaller as you go deeper in. This leads to small updates.

Answer 67

False. We want to them to be the same.

Answer 68

Xavier/Xavier2 initialization

Answer 69

1. Determines the activation statistics, and consequently, the gradient statistics 2. Can impact vanishing/exploding gradient problems

Answer 70

Zero. A batch normalization layer simply calculates the mean and variance of a mini-batch and then normalizes the output. The goal of this is to improve gradient flow. If we want the model to learn to decide whether to normalize or not, we can add learnable parameters for the scale and shift.

Answer 71

To learn for itself whether to normalize or not

Answer 72

Since these layers need to compute mean/stdev using each mini-batch, we need to ensure that the batch size is large enough that we get good estimates of these parameters for each batch.

Answer 73

False. We don't want very low/high values going into an activation layer as this will cause saturation and prevent gradients from flowing.

Answer 74

False. This used to be the case, but more analysis has tended to suggest that the problem lies elsewhere, such as: * Noisy gradient estimates (i.e. computing over mini-batches * Saddle points * Ill-conditioned loss surfaces

Answer 75

False. It IS an unbiased estimator, but the batch sizes we're using might be an incredibly small percentage of the total training data, so our estimates will be very noisy. This results in noisy, jerky gradient descent steps.

Answer 76

It's analogous to Newton's Law: "An object in motion tends to remain in motion...". It helps overcome flat regions or saddle points in the loss surface. If we've just went down a steep "hill", then we'll keep going in that direction. Rather than update the weights using the gradient itself, we use the "velocity" of the gradient to make the update.

Answer 77

True. We want to use what's been happening in the local region to inform the direction of our momentum.

Answer 78

It tells us how different the curvature is along different dimensions. Think about the surface of an umbrella vs. the bottom of a canoe. If condition number is HIGH, as in the case of the canoe, a small step in the direction of the beam of the boat results in a big change in the other direction. Mathematically, it is the ratio of the largest and smallest eigenvalue.

Answer 79

False. If the 'condition number' is high, then SGD will make big steps in some dimensions and small steps in the other.

Answer 80

To exploit the fact that directions in the loss surface with high curvature will produce higher gradients, allowing us to reduce the learning rate for that particular weight.

Answer 81

Since we're summing up gradients over iterations in the denominator, as gradients are accumulated the learning rate will go to zero (i.e. saturation)

Answer 82

It does not saturate the learning rate (i.e. go to zero like Adagrad does)

Answer 83

How much we care about the past accumulation of the gradients compared with the current gradient.

Answer 84

False. It uses both first and second order.

Answer 85

Tends to be unstable in the beginning as one or both moments will be tiny values.

Answer 86

True, but it typically requires more tuning.

Answer 87

1. Graduate students (i.e. manual) 2. Step scheduler 3. Exponential scheduler 3. Cosine scheduler (this one is pretty cool - it actually adjusts it in a cyclical fashion)

Answer 88

It simply penalizes the loss by the sum of the weights multiplied by some small value.

Answer 89

Sparsity in weights (i.e. making many of the values zero)

Answer 90

False** **There actually are some cool papers I've read where dropout is used in inference as an estimator of model uncertainty, e.g. Monte Carlo dropout, etc.

Answer 91

0.8. Remember that the dropout probability is the probability that we KEEP the node, so we have 1 - 0.2 = 0.8

Answer 92

1. It keeps the model from relying too heavily on particular features 2. It is essentially equivalent to training 2^n models (each configuration is technically its own network)

Answer 93

Overfitting

Answer 94

Monitor everything to understand what is going on! Loss, accuracy curves, gradient flows, etc.

Answer 95

[0, +infinity]

Answer 96

Validation loss starts to go up while training loss continues to go down.

Answer 97

Validation loss very close to training loss, or both are high.

Answer 98

If regularization is being used, it's only applied during training. This means the weight penalty is only applied during training and won't be reflected in the validation loss.

Answer 99

1. Learning rate | 2. Weight decay

Answer 100

False. These things are highly interdependent!

Answer 101

Generally false. Some papers suggest this combination is actually worse.

Answer 102

False. The learning rate should be changed proportionally to batch size, i.e. increase the learning rate for larger batch sizes. (Larger batches are better estimators, so in theory we can take larger steps)

Answer 103

True. Learning rate should be changed proportionally to batch size.

Answer 104

TP / (TP + FN)

Answer 105

FP / (FP + TN)

Answer 106

(TP + TN) / (TP + TN + FP + FN)

Answer 107

TP / (TP + FP)

Answer 108

TP / (TP + FN)

Answer 109

``` ----------------- | TP | FN | --------|--------| | FP | TN | ----------------- ```

Answer 110

Focal loss

Quiz #2 Flashcards

(139 cards)