Quiz #2 Flashcards

1
Q

What steps and primary questions comprise the data wrangling process?

A
  1. What is the population of interest?
    1. 1 What sample S are we evaluating?
      1. Is sample S representative of the population?
  2. How do we cross-validate to evalute our model? How do we ovoid overfitting and data mining?
  3. What prediction task (classification vs. regression) do we care about? What is the meaningful evaluation criteria?
  4. How do we create a reproducible pipeline?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some examples of the definition of a population (in data terms)?

A

All users on Facebook.
All US users on Facebook.
All US users on Facebook in the last month
All the watermelons in the back of the truck.
All the watermelons greater than 5lbs in the back of the truck.
Etc…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do we obtain data from a population?

A

Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are two simple probability-based methods for sampling?

A
  1. Random Sampling

2. Stratified Random Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is simple random sampling of a population?

A

Every observation from the population has the same chance of being sampled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is stratified random sampling of a population?

A

Population is partitioned into groups and then a simple random sampling approach is applied within each group.

Example: In the watermelons in the back of the truck example, we could partition into 3 groups: (1) less than 5lbs, (2) greater than 5lbs but less than 10lbs, and (3) greater than 10lbs. We could then randomly sample within each group. This is stratified random sampling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some best practices for data wrangling?

A
  1. Clearly define your population and sample
  2. Understand the representativeness of your sample
  3. Cross-validation can go wrong in many ways; understand the relevant problem and prediction task that will be done in practice
  4. Know the prediction task of interest (regression vs. classification)
  5. Incorporate model checks and evaluate multiple predictive performance metrics?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Cross Validation (CV)?

A

A method for estimating prediction error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Grid search is always better than random search when trying to optimize hyperparameters? (True/False)

A

False. One 2012 paper by Bergstra and Bengio found that random search is often just as good, if not better, than grid search.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are two methods for handling class imbalances?

A
  1. Sampling based, e.g. SMOTE (Synthetic Minority Oversampling, etc.)
  2. Cost-based, e.g. Focal Loss for object detection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is one type of plot we can use to gauge the confidence a model has in its prediction?

A

Calibration plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

It isn’t necessary to include a datasheet when creating a new dataset? (True/False)

A

False. It can be very helpful to future researchers (including yourself!) to understand how the dataset was constructed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What things should be included in a datasheet for a dataset?

A
  1. Motivation (why is the dataset needed?)
  2. Composition
  3. Collection process
  4. Recommended uses
    …etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the three steps in the Data Cleaning process for ML?

A
  1. Clean
  2. Transform
  3. Preprocess
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are three mechanisms that can cause missing data?

A
  1. Missing completely at random.
  2. Missing at random: likelihood of any observation to be missing depends on OBSERVED data features (ex: men are less likely to fill out surveys about depression)
  3. Missing not at random: likelihood of any observation to be missing depends on UNOBSERVED outcome (ex: a person might be less likely to complete a survey if they are depressed)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some ways we can fix missing data?

A
  1. Remove (easy, but wasteful)

2. Imputation (mean/median, using learned model to predict, etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are some examples of the data transformation step in the data cleaning process?

A
  1. Converting categorical to index (ordinal numbering, one-hot encoding, etc)
  2. Bag-of-words
  3. TF-IDF
  4. Embeddings
    …etc
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are some examples of the data preprocessing step in the data cleaning process?

A

Zero-center data, normalization, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are three important components of fairness in ML?

A
  1. Anti-classification: verifying that protected attributes like race, gender, etc (and their proxies!)
  2. Classification Parity: common measures of predictive performances are equal across groups defined by protected attributes.
  3. Calibration: conditional on risk estimates, outcomes are independent of protected
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is an example of how a proxy to a protected attribute might result in an unfair ML model?

A

One example might be using features like zip code in areas with high racial segregation. If the model learns that zip code is an important discriminatory feature, there’s a good chance that it has learned a subtle proxy for racial discrimination.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Layers in a NN must always be fully connected? (True/False)

A

False. Other connectivity structures are possible, and in many cases (like images) desirable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Why does it make sense to consider small patches of inputs when building a NN for image data? What are these small patches called?

A

They are called receptive fields, modeled after similar structure in the human visual cortex. They make sense to use because while structure exists in image data, it’s often localized, such as edges and lines, and collections of those lines and edges forming higher level motifs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Why does using linear layers not make sense for some applications?

A

Consider the case of image data. If we connect each pixel to every weight in a hidden linear layer, there could be hundreds of millions of parameters to learn for just one layer. Furthermore, patterns in images tend to be SPATIALLY LOCAL. A pixel in the upper right corner in all likelihood will have very little to do with a pixel in the lower left.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

As the number of parameters to learn in a model increase, more data is needed to ensure a robust model that generalizes to new data? (True/False)

A

True.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

If we have a receptive field that DOES NOT share weights and is 3x3 pixels connected to 5 output nodes, how many parameters will there be to learn?

A

((K1 x K2) + 1) * N –> ((33) + 1)5 = 50

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

For image data, it is necessary to learn location specific features? (True/False)

A

False. There’s no reason to assume that a pattern in an image that occurs in the center might not also be repeated or at some other time appear in some other arbitrary location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are shared weights in a CNN, and why do we use them?

A

Output nodes in different location sharing weights across the input space. For example W11 in the leftmost node would be the same as W11 in the rightmost node. We use shared weights so that we can learn spatial features that are invariant to simple affine transformations, e.g. translation, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

If we have a receptive field that DOES share weights and is 3x3 pixels connected to 5 output nodes, how many parameters will there be to learn? (assume that this calculation is only considering a single feature extractor)

A

(K1k2) + 1 = 33 + 1 = 10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

In a CNN, weights are shared across the outputs node weights as well as the different feature extractors? (True/False)

A

False. Weights are shared for the SAME feature extractor across the spatial input, but they are NOT shared between DIFFERENT feature extractors, i.e. each feature extractor has its own independent set of weights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

If we have a receptive field that DOES share weights and is 3x3 pixels connected to 5 output nodes and there are 4 individual features we want to learn, how many parameters will there be to learn? (assume that this calculation is only considering a single feature extractor)

A

(K1K2 + 1) * M, where M is the number of features we want to learn –> (33 + 1)*4 = 40

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

It is extremely important to remember to flip the kernel when implementing cross-correlation for a “convolutional” layer in a NN? (True/False)

A

False. Mathematically it’s useful to flip the kernel to make the math work out more elegantly, since we’re actually learning the kernel values in a CNN, the weights will be initialized randomly making the flipping operation superfluous is practice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How do we implement the convolutional operation in a neural network in practice?

A

Simply take the dot product of the input with the kernel (i.e. element-wise multiply and sum).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

If we implement the forward pass in a convolutional layer as cross-correlation, what operation will the backpropagation be?

A

Convolution. (this is the duality principle that arises in the forward/backward pass of a convolutional layer)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Convolution is a complex non-linear operation? (True/False)

A

False. It is a simple linear operation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

If we implement the forward pass in a convolutional layer as convolution, what operation will the backpropagation be?

A

Cross-Correlation. (this is the duality principle that arises in the forward/backward pass of a convolutional layer)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is the “valid” form of convolution and what is the output size?

A

It only applies the kernel when it is fully within the image. The output size is: (H - K1 + 1) x (W - K2 + 1). The output dimensions will be SMALLER than the input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is the “padded” form of convolution and what is the output size?

A

This form of convolution can be used to force the input size to be the same as the output by adding padding to the input image (Zeros, mirrored, etc). In general the output is size is: (H - K1 + P1 + 1) x (W - K2 + P2 + 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

If we apply valid convolutional on a 5x5 input image using a 3x3 kernel, what will the output size be?

A

(H - K1 + 1) x (W - K2 + 1) = (5 - 3 + 1) x (5 - 3 + 1) = 3x3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

If we apply padded convolution on a 5x5 image with a 3x3 kernel and a padding size of 1, what will the output size be?

A

(H - K1 + P1 + 1) x (W - K2 + P2 + 1) = (5 - 3 + 1 + 1) x (5 - 3 + 1 + 1) = 4x4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Using a stride greater than one is a good way of performing dimensionality reduction on the input?

A

Typically false. Using a stride greater than one results in losing information because we’re skipping pixels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

For a multi-channel input, what is the shape of the kernel we use?

A

C x K1 x K2, where C is the dimensionality of the input. For example, an RGB image would be 3 x K1 x K2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Why do we (in general) not end up simply learning the exact same feature maps when using multiple kernels per layer?

A

Because we initialize the weights randomly, so as gradient descent is applied the weights in each map will tend to converge to different values as a result of their different starting state. However, it is still possible learn redundant feature maps, so random initialization is more of a heuristic than a guarantee.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

If we apply convolutional using M number of kernels to learn different features, what will be the number of channels in the output?

A

Since we concatenate the feature maps in the output, we’ll also have M output channels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

For an RGB (i.e. 3 channel) input image with N=4 filters and kernel size K1=K2=3, how many parameters will have to be learned for that layer? How many channels would there be in the output feature maps?

A

N * (K1 * K2 * 3 + 1) = 4 * (3 * 3 * 3 + 1) = 112. The number of channels in the output is simply equal to the number of filters, so 4.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What are pooling layers used for?

A

Dimensionality reduction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Describe how a max pooling layer works and why it is useful?

A

It is useful for dimensionality reduction of the input. It is performed by striding a window across the image, but instead of applying convolution, the max-operation is applied to every window. This gives us a scalar output from a matrix input. For example, a 3x3 window would have 9 elements, but by applying the max operation, this becomes a single value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

How many learned parameters are required to learn in a max pooling layer?

A

None. The max argument takes no arguments other than the input, so nothing needs to be learned.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

The only operation that can be performed in a pooling layer is taking the maximum? (True/False)

A

False. Any differentiable function can be used (e.g. average, etc.). In practice though, it’s uncommon to use something other than max pooling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Why is the combination of a convolutional layer with a pooling layer particularly powerful?

A

This combination allows learned features to exhibit some degree of INVARIANCE to simple affine transformations like translation. If the translation of some feature in the image is within the bounds of the pooling layer, it should still be recognized by the feature map.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

If a feature (such as a bird’s beak) were translated a little bit, the location of the output values from convolutional layer would remain unchanged? (True/False)

A

False. Convolution has the property of ‘Equivariance’. A translation of the feature results in the output being shifted by the same amount.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What are two important properties of convolution?

A
  1. Invariance (features with small transformations/deformations should still activate the output)
  2. Equivariance (no matter where the feature occurs in the image, the feature map will be activated, with the output values moving by the same translation)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What are four important design decision that must be made when developing a DNN?

A
  1. Architecture
  2. Data considerations
  3. Training and Optimization
  4. Machine Learning considerations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What are some important architectural considerations we should make when designing a DNN?

A
  1. What modules (layers) should we use?
  2. How should they be connected together?
  3. Can we use domain knowledge to add architectural biases?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

A FC neural net can accept inputs with dynamic shapes? (True/False)

A

False. This is one of the downsides of a FC NN. Since every input is connected to all the weights, it can can only accept rigid inputs shapes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What are some important optimization considerations we should make when designing a DNN?

A
  1. What optimizer should we use? Different optimizers make different weight updates depending on the gradients.
  2. How should we initialize the weights? If we initialize far away from the minima, can our optimizer actually get us there?
  3. What regularizers should we use? DNN often have more parameters than data. Regularization is often a must to avoid overfitting.
  4. What loss function is appropriate? Many different options available.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What are some important machine learning considerations we should make when designing a DNN?

A
  1. Balancing the trade off between model CAPACITY and AMOUNT OF DATA
  2. Adding appropriate biases based on domain knowledge
57
Q

What is one of the most important principles in designing the architecture of a DNN?

A

The flow of gradients. This is crucial. If the gradient doesn’t flow, the model can’t learn. Gradients become smaller and smaller as they flow back towards the input.

58
Q

The update rate across an entire DNN is always equal everywhere? (True/False)

A

False. The update is a function of the gradient. As we flow further back in the network, the gradients become smaller. Furthermore, when we actually make the update, we’re multiplying the gradient by a learning rate so that the move we make becomes even smaller. This means different parts of the network may experience very different learning experiences.

59
Q

In some cases, it is possible for a layer to stop the gradient from flowing back to the layers below? (True/False)

A

True. This is called a bottleneck.

60
Q

Combination of only linear layers has the same representational power as one linear layer? (True/False)

A

Yes, because we could simply multiply the layers together.

61
Q

Using only linear layers, we can increase the amount of representational power we have over just a single linear layer?

A

False. If we only use linear layers, the output response is simply a linear response of all the inputs and the weights (i.e. we could just multiple them together). It gives us no new representational power (we fix that by using non-linear activations in a NN).

62
Q

What allows us to obtain complex transformations of data in a DNN?

A

The composition of multiple non-linear layers.

63
Q

The gradient flow across linear layers and non-linear layers are both highly sensitive to the shape of the function they are modeling?

A

False. The gradient flow across linear layers is straightforward. Gradient flow across non-linear layers is not, and is strongly impacted by its shape.

64
Q

What are some aspects of non-linear activation functions that we can analyze?

A
  1. Min/Max
  2. Correspondence between input and output statistics
  3. Gradients:
    1. 1 At initialization (small values, etc)
    2. 2 At extremes
  4. Computational complexity
65
Q

What are the minimum and maximum output values of the sigmoid function?

A

0 and 1, respectively.

66
Q

The sigmoid function does NOT saturate at both ends? (True/False)

A

False, it does saturate. It is always 1 as x –> +infinity, and always 0 as x –> -infinity

67
Q

The gradient of the sigmoid function vanishes at both ends and is also always positive? (True/False)

A

True. The min/max value of sigmoid is 0/1, which means the function is saturated at both ends which results in vanishing gradients.

68
Q

Why are vanishing gradients a problem?

A

Because the flow of gradients is how we make updates to DNN. Since we backpropagate gradients, if an upstream is small to begin with, it will become “vanishingly” small as it flows back.

69
Q

The sigmoid function is good from a computational efficiency standpoint? (True/False)

A

False. It has the exponential term in it.

70
Q

What is the min/max value of the hyperbolic tangent (tanh) activation function?

A

Min: -1, Max: 1

71
Q

The tanh activation function is capable of flipping the sign of input features? (True/False)

A

True. This is because it’s range is -1 to 1

72
Q

The tanh activation function is saturated at both ends? (True/False)

A

True. It’s min/max values are -1 and 1

73
Q

The tanh activation function is centered at the origin? (True/False)

A

True

74
Q

The tanh activation function does not suffer from the vanishing/exploding gradient problem? (True/False)

A

False. Just like sigmoid, it vanishes at both ends and is always positive.

75
Q

What can cause gradients to explode?

A

Activation functions whose gradient is always positive.

76
Q

The ReLU activation function is centered at the origin and symmetric? (True/False)

A

False. Its min output is 0, max output +infinity

77
Q

The ReLU activation function can output negative values? (True/False)

A

False. It’s range is from 0 to +infinity.

78
Q

Why is gradient flow more effective with ReLU than many other activation functions?

A

Because the gradient is not saturated on the positive end (it goes to +infinity).

79
Q

ReLU activation neurons are impervious to negative input values? (True/False)

A

False. ReLU takes on a value of zero for any negative inputs, which means a ReLU neuron can effectively “die”, if its input is always negative.

80
Q

The gradient of ReLU is constant (assuming positive input)? (True/False)

A

True.

81
Q

ReLU is a computationally expensive function to compute? (True/False)

A

False. It is simply the max function, which is easy to compute.

82
Q

What is an example of an activation function that tries to address the “Dying ReLU” problem?

A

Leaky ReLU is one popular alternative. It has a small, positive slope for all input values 0, so it’s range is from -infinity, + infinity.

83
Q

The Leaky ReLU activation function still suffers from the “dying neuron” problem? (True/False)

A

False. The Leaky ReLU function does not saturate for any input value, whereas normal ReLU does saturate for inputs <=0

84
Q

Leaky ReLU is much more computationally complex to compute than regular ReLU? (True/False)

A

False.

85
Q

The ReLU activation function is differentiable? (True/False)

A

Strictly false, but in practice it can be treated as piecewise linear, and the derivative for x <=0 and x > 0 calculated separately.

86
Q

ReLU is always the best activation function to choose? (True/False)

A

False. ReLU is the most common choice, but the best choice is a function of your problem at hand. No “one-size-fits-all” solution.

87
Q

When might the sigmoid activation function be one you would want to choose?

A

If you needed to clamp the outputs to the 0 to 1 range would be one case. In general sigmoid is not the best choice though.

88
Q

Initialization of the model parameters isn’t generally very important in most situations? (True/False)

A

False. It is extremely important. Imagine a case where you initialize the weights to all be within the saturation range of your activation functions. Your model would never learn anything.

89
Q

Initialization of the parameters plays a big role in how gradients flow at the end of training? (True/False)

A

False. Naturally, it has a big impact on the BEGINNING of training.

90
Q

It is possible to use initialization of the parameters as a form of pseudo-regularization to limit the full capacity of the model? (True/False)

A

True. You could initialize parameters so that the inputs to the activation functions fall within the linear or nonlinear range to obtain different behavior and limit capacity.

91
Q

The initial starting values for the model parameters has little impact on how fast training converges to a good local minima? (True/False)

A

False.

92
Q

What happens if weights are initialized to a constant value?

A

This is a degenerate solution. If all the weights are the same, so will the gradients, so the model can never learn!

93
Q

Why is it generally preferred to initialize weights to a small, normally distributed random number?

A

It prevents the model from starting in a biased state. Unless we have some good reason to think one feature over another is more important, we should treat the probability of any particular hypothesis in the weight space being true as uniform.

94
Q

Deeper networks are more sensitive to initialization of parameters? (True/False)

A

True. This is because in a deep net the activations get smaller as you go deeper in. This leads to small updates.

95
Q

Ideally, we would like for the variance at the output to be different to that of the input in a NN?

A

False. We want to them to be the same.

96
Q

What is one method of parameter initialization that helps maintain the variance at the output to be the same as that of the input?

A

Xavier/Xavier2 initialization

97
Q

What are two reasons why parameter initialization matters?

A
  1. Determines the activation statistics, and consequently, the gradient statistics
  2. Can impact vanishing/exploding gradient problems
98
Q

Normalization generally doesn’t have an impact on gradient flow? (True/False)

A

False

99
Q

How many learnable parameters are there in a vanilla batch normalization layer? (Assume scaling and offset are not being learned)

A

Zero. A batch normalization layer simply calculates the mean and variance of a mini-batch and then normalizes the output. The goal of this is to improve gradient flow.

If we want the model to learn to decide whether to normalize or not, we can add learnable parameters for the scale and shift.

100
Q

What does including learnable scale and offset parameters in a batch normalization allow the model to do?

A

To learn for itself whether to normalize or not

101
Q

Why is it important to use sufficiently large batch sizes if using batch normalization layers in your network?

A

Since these layers need to compute mean/stdev using each mini-batch, we need to ensure that the batch size is large enough that we get good estimates of these parameters for each batch.

102
Q

In general, it is not a good idea to perform normalization prior to a non-linearity? (True/False)

A

False. We don’t want very low/high values going into an activation layer as this will cause saturation and prevent gradients from flowing.

103
Q

The existence of local minima is the main issue in optimization?

A

False. This used to be the case, but more analysis has tended to suggest that the problem lies elsewhere, such as:

  • Noisy gradient estimates (i.e. computing over mini-batches
  • Saddle points
  • Ill-conditioned loss surfaces
104
Q

Using mini-batches of data to calculate the loss and gradients always results in an unbiased estimator with low variance? (True/False)

A

False. It IS an unbiased estimator, but the batch sizes we’re using might be an incredibly small percentage of the total training data, so our estimates will be very noisy. This results in noisy, jerky gradient descent steps.

105
Q

What is momentum used for in optimization?

A

It’s analogous to Newton’s Law: “An object in motion tends to remain in motion…”. It helps overcome flat regions or saddle points in the loss surface. If we’ve just went down a steep “hill”, then we’ll keep going in that direction. Rather than update the weights using the gradient itself, we use the “velocity” of the gradient to make the update.

106
Q

When using momentum to accelerate gradient descent, gradient updates from further back are more heavily discounted? (True/False)

A

True. We want to use what’s been happening in the local region to inform the direction of our momentum.

107
Q

What is the ‘condition number’ in the context of optimization?

A

It tells us how different the curvature is along different dimensions. Think about the surface of an umbrella vs. the bottom of a canoe. If condition number is HIGH, as in the case of the canoe, a small step in the direction of the beam of the boat results in a big change in the other direction.

Mathematically, it is the ratio of the largest and smallest eigenvalue.

108
Q

SGD will always take the same size steps across all dimensions?

A

False. If the ‘condition number’ is high, then SGD will make big steps in some dimensions and small steps in the other.

109
Q

What is the motivation behind the use of adaptive, per-weight learning rate algorithms?

A

To exploit the fact that directions in the loss surface with high curvature will produce higher gradients, allowing us to reduce the learning rate for that particular weight.

110
Q

What is the principal problem with Adagrad optimization?

A

Since we’re summing up gradients over iterations in the denominator, as gradients are accumulated the learning rate will go to zero (i.e. saturation)

111
Q

What is one advantage of using RMSProp as an optimizer?

A

It does not saturate the learning rate (i.e. go to zero like Adagrad does)

112
Q

What does the beta hyperparameter in the RMSProp optimizer control?

A

How much we care about the past accumulation of the gradients compared with the current gradient.

113
Q

The Adam optimizer only uses first moment statistics for gradients? (True/False)

A

False. It uses both first and second order.

114
Q

What is one of the main challenges in using Adam as an optimizer?

A

Tends to be unstable in the beginning as one or both moments will be tiny values.

115
Q

In general, plain SGD + Momentum generalizes better than adaptive methods? (True/False)

A

True, but it typically requires more tuning.

116
Q

What are some common learning rate schedulers?

A
  1. Graduate students (i.e. manual)
  2. Step scheduler
  3. Exponential scheduler
  4. Cosine scheduler (this one is pretty cool - it actually adjusts it in a cyclical fashion)
117
Q

How does the L1 norm work for regularization?

A

It simply penalizes the loss by the sum of the weights multiplied by some small value.

118
Q

What behavior does L1 normalization tend to encourage in models?

A

Sparsity in weights (i.e. making many of the values zero)

119
Q

Dropout is applied during both the training and test phases? (True/False)

A

False**

**There actually are some cool papers I’ve read where dropout is used in inference as an estimator of model uncertainty, e.g. Monte Carlo dropout, etc.

120
Q

If we set the dropout probability 0.2, what is the chance that that weight will be dropped at each iteration?

A

0.8. Remember that the dropout probability is the probability that we KEEP the node, so we have 1 - 0.2 = 0.8

121
Q

What are some interpretations of why dropout works?

A
  1. It keeps the model from relying too heavily on particular features
  2. It is essentially equivalent to training 2^n models (each configuration is technically its own network)
122
Q

What can data augmentation help prevent?

A

Overfitting

123
Q

What is the key principle in the process of training a deep learning model?

A

Monitor everything to understand what is going on! Loss, accuracy curves, gradient flows, etc.

124
Q

What are the bounds of the cross-entropy loss function?

A

[0, +infinity]

125
Q

What is the classic case of overfitting?

A

Validation loss starts to go up while training loss continues to go down.

126
Q

What is the classic case of underfitting?

A

Validation loss very close to training loss, or both are high.

127
Q

Why will training loss often appear larger than validation loss, even on a good model?

A

If regularization is being used, it’s only applied during training. This means the weight penalty is only applied during training and won’t be reflected in the validation loss.

128
Q

What two hyperparameters are deep learning models particularly sensitive to, and that we should always tune?

A
  1. Learning rate

2. Weight decay

129
Q

Hyperparameters and module selection can generally be chose independently? (True/False)

A

False. These things are highly interdependent!

130
Q

It’s always a good idea to use batch normalization and dropout together? (True/False)

A

Generally false. Some papers suggest this combination is actually worse.

131
Q

The learning rate is independent of batch size? (True/False)

A

False. The learning rate should be changed proportionally to batch size, i.e. increase the learning rate for larger batch sizes. (Larger batches are better estimators, so in theory we can take larger steps)

132
Q

If we increase the batch size, we should also increase the learning rate? (True/False)

A

True. Learning rate should be changed proportionally to batch size.

133
Q

How is the True Positive Rate (TPR) calculated?

A

TP / (TP + FN)

134
Q

How is the is the False Positive Rate (FPR) calculated?

A

FP / (FP + TN)

135
Q

How is accuracy calculated?

A

(TP + TN) / (TP + TN + FP + FN)

136
Q

How is precision calculated?

A

TP / (TP + FP)

137
Q

How is recall calculated?

A

TP / (TP + FN)

138
Q

What are the components of a confusion matrix and how are they arranged?

A
-----------------
|  TP  |  FN  |
--------|--------|
|  FP  |  TN  |
-----------------
139
Q

What is an example of a loss function that can be used with imbalanced classes?

A

Focal loss