lecture 6 - DNNs Flashcards

1
Q

How do shallow networks differ from deep networks?

A

Shallow networks have fewer hidden layers (1 or 2), while deep networks have many hidden layers (e.g., 5 or more).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why are deep neural networks considered a separate research area?

A

Deep networks come with additional complexities in learning. Traditional gradient descent doesn’t work well for them, requiring specialized techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why are deep networks necessary for tasks like image recognition and language processing?

A

The additional layers and nodes in deep networks enable the creation of more complex features, which are essential for these tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why use deep neural networks instead of shallow networks that are already universal approximators?

A

Deep neural networks can compute many functions with fewer layers, whereas shallow networks require exponentially more hidden units to achieve the same level of approximation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do deep and shallow networks compare in terms of required resources for complex problems?

A
  1. Deep networks require only a logarithmic number of neurons O(log n) for certain problems.
  2. Shallow networks require exponentially many neurons
    O(2^n) for the same problems.
  • Deep networks exhibit logarithmic growth, whereas shallow networks exhibit exponential growth in the number of required neurons.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is an example of a problem where deep networks outperform shallow networks in terms of efficiency?

A

For problems like x_1 XOR x_2 XOR x_3 XOR x_4, deep networks can solve the problem with far fewer neurons than shallow networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is forward propagation in DNNs?

A

Forward propagation is the process of passing input data through the layers of a neural network to compute the final output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the two main operations performed by each layer in a neural network during forward propagation?

A
  1. linear transformation: Inputs from the previous layer are multiplied by a weight matrix and added to a bias vector, resulting in the pre-activation values.
  2. non-linear activation: The pre-activation values are passed through an activation function to compute the final output layer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is a non-linear activation function important in forward propagation?

A

The non-linear activation function introduces non-linearity, enabling the neural network to learn and model complex patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is backward propagation in DNNs?

A

Backward propagation is the process used to calculate the gradient of the loss function with respect to the weights and biases in the network. The gradients are used to update the parameters via optimization (e.g., gradient descent).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the starting point for backward propagation?

A

The starting point is the derivative of the loss with respect to the output, da^[l], in the final layer L.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the key steps in computing the gradient flow for each layer during backward propagation?

A
  1. compute the gradient of the pre-activation values (dz^[l])
  2. compute the gradient of the weights (dW^[l])
  3. compute the gradient of the biases (db^[l])
  4. backpropagate the error to the previous layer (da^[l-1])
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are hyperparameters in a neural network?

A

Hyperparameters are parameters set before training a neural network, influencing the learning process and performance of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does the number of layers affect a neural network?

A
  • The number of layers refers to the depth of the network.
  • More layers allow the network to learn complex hierarchical features but increase computational complexity and the risk of overfitting.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does the number of units per layer determine in a neural network?

A

The number of units per layer determines the width of the network. More units increase the model’s capacity but may also lead to overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is learning rate an important hyperparameter in gradient descent?

A
  • The learning rate controls the step size during gradient descent.
  • A high learning rate might overshoot the minimum, while a low learning rate can make training slow.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What role do activation functions play in neural networks?

A
  • Activation functions define the transformation applied to the input at each layer.
  • Common choices include ReLU, Sigmoid, and Tanh.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Important hyperparameters

A
  1. layers
  2. units
  3. learning rate
  4. activation functions
  5. batch size
  6. weight initialization strategies
  7. dropout rate
  8. optimizers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How should datasets be divided to evaluate and optimize hyperparameters?

A
  1. Training set: Used to train the model.
  2. Development set: Used to tune the hyperparameters.
  3. Test set: Used to evaluate the performance of the algorithm.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are bias and variance in the context of neural networks?

A
  1. Bias: Error due to overly simplistic assumptions in the model. High bias indicates underfitting.
  2. Variance: Error due to high sensitivity to small fluctuations in the training data. High variance indicates overfitting.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why should you address bias before variance in a model?

A
  • High bias indicates that the model poorly approximates the data regardless of variance.
  • Then, decrease the variance
  • Since decreasing variance can increase bias, it’s crucial to then check the bias again
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the characteristics of a model with high variance?

A
  • A high-variance model performs well on the training set but poorly on the validation/dev set.
  • Example: Train set error = 1%, Dev set error = 11%.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How can you reduce high variance in a neural network?

A
  1. Increase the size of the training data.
  2. Apply regularization.
  3. Adjust the network architecture (e.g., add more layers or units).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are the characteristics of a model with high bias?

A
  • A high-bias model performs poorly on both the training and validation/dev sets.
  • Example: Train set error = 15%, Dev set error = 16%.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How can you reduce high bias in a neural network?

A
  1. Use a larger network.
  2. Train for a longer duration.
  3. Adjust the architecture (e.g., add more layers or units).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are the characteristics of a model with both high bias and high variance?

A
  • The model has large errors on both the training set and an even larger error on the dev set, indicating that it neither fits the training data well nor generalizes.
  • Example: Train set error = 15%, Dev set error = 30%.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How can you address high bias and high variance?

A

Address the bias first (by increasing model capacity or training longer) and then reduce variance (by applying regularization or using more data).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are the characteristics of a model with low bias and low variance?

A
  • This is the ideal case where both training and dev errors are low, with the dev set error slightly higher than the training set error.
  • Example: Train set error = 0.5%, Dev set error = 1%.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

high variance plot

A

circles exactly around all the targets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

high bias plot

A

rigid straight/slanted line that captures most of the points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

high bias high variance plot

A

straight line that makes an exception for outlying point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

low bias low variance plot

A

circles around most of the target points but leaves the outlying point out.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is dropout regularization in neural networks?

A

Dropout is a regularization technique used to prevent overfitting by randomly dropping out a proportion of neurons during the training process. The dropped-out neurons do not participate in forward and backward propagation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

How does dropout regularization help in preventing overfitting?

A

By randomly turning off neurons, dropout forces the network to distribute the learned weights across different neurons, preventing any single neuron from becoming too dominant and improving the model’s generalization ability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Why is it impractical to use one regularization term for both layers in deeper neural networks?

A

As the number of layers increases, using a single regularization term becomes impractical because it adds too many hyperparameters to tune. Dropout addresses this issue without adding extra hyperparameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What do L2 regularization and dropout target?

A
  1. L2 Regularization: Prevents overly large weights.
  2. Dropout: Prevents reliance on specific neurons.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is early stopping in neural network training?

A
  • Alternative to regularization
  • Early stopping involves halting the training process at the point where the validation error is smallest, preventing overfitting by not allowing the model to train further once it starts overfitting.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

How does data augmentation help in reducing overfitting?

A

Data augmentation artificially increases the size and diversity of the training dataset by applying transformations (e.g., rotations, translations) to the data, making the model more robust to variations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is the key benefit of data augmentation?

A

It helps models learn to ignore irrelevant variations, such as minor shifts or distortions, and focus on the important features that define the object or class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Why is input normalization important for optimization using gradient descent?

A

Without normalization, input features with varying scales can lead to elongated cost function contours, making gradient descent inefficient as it must zig-zag through narrow valleys to converge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

How does normalizing the inputs affect the cost function contours?

A

Normalizing inputs (scaling to have a mean of 0 and variance of 1) results in circular contours, allowing gradient descent to converge faster by following a more direct optimization path.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What are batch normalized layers, and how do they help?

A

Batch normalized layers normalize activations after a few forward propagation layers, then propagate forward again. This helps maintain learnability and improves training stability and speed.

43
Q

What is the key takeaway about normalizing inputs for deep learning models?

A

Always normalize input features to improve training stability and speed in deep learning models.

44
Q

What are vanishing and exploding gradients, and why are they problematic in deep learning?

A
  • In very deep networks, the output can be expressed as a product of many weight matrices.
  • If the largest eigenvalue of
    W is greater than 1, gradients explode
  • If the largest eigenvalue of
    W is less than 1, gradients vanish.
  • Both scenarios cause instability during training.
45
Q

What happens during vanishing gradients, and what is the effect on training?

A
  • Gradients become very small (approach 0) as they backpropagate through deep layers, especially with activation functions like Sigmoid or Tanh.
  • This causes the weights in earlier layers to update very slowly, leading to poor learning.
46
Q

How can vanishing gradients be mitigated?

A

One solution is to pass information of the gradient to the layer after the next layer (skip layer connection), helping maintain gradient flow.

47
Q

What happens during exploding gradients, and what is the effect on training?

A
  • Gradients grow excessively large in deeper layers, causing instability in learning and
  • Leads to large oscillations in gradient descent or numerical overflow.
48
Q

How can exploding gradients be mitigated?

A

One solution is gradient clipping, which limits the maximum value of the gradient to prevent large jumps in gradient descent.

49
Q

What is mini-batch gradient descent?

A
  • Mini-batch gradient descent is an optimization algorithm that divides the training dataset into smaller subsets called batches.
  • The algorithm computes gradients based on a batch rather than the entire dataset or a single sample.
50
Q

What is the advantage of using mini-batch gradient descent over batch gradient descent?

A

Mini-batch gradient descent allows faster computation by leveraging vectorization and reduces memory requirements compared to batch gradient descent.

51
Q

How does forward propagation work in mini-batch gradient descent?

A

In mini-batch gradient descent, forward propagation is performed on a set of data (a mini-batch), the error is measured, and the derivative of the error is used in backpropagation.

52
Q

How does the cost decrease differently in batch gradient descent versus mini-batch gradient descent?

A
  1. Batch Gradient Descent: The cost decreases steadily with iterations.
  2. Mini-batch Gradient Descent: The cost converges to the same place as batch gradient descent but oscillates around it.
53
Q

What is batch gradient descent?

A

Batch gradient descent uses the entire training dataset to compute the gradient for each iteration, resulting in smooth convergence but being computationally expensive for large datasets.

54
Q

How is batch gradient descent represented in visualizations?

A

It is represented in blue, showing a smooth path toward the optimum.

55
Q

What is stochastic gradient descent (SGD)?

A

SGD updates the weights after computing the gradient for a single data point, resulting in much faster updates but noisier and more erratic convergence, with a higher risk of overshooting the minimum.

56
Q

How is stochastic gradient descent represented in visualizations?

A

It is represented in purple, showing a zig-zag, erratic path.

57
Q

How does mini-batch gradient descent combine the benefits of batch and stochastic gradient descent?

A

Mini-batch gradient descent combines the faster updates of SGD with the smoother trajectory of batch gradient descent, circling around the optimal value.

58
Q

How is mini-batch gradient descent represented in visualizations?

A

It is represented in green, showing a less chaotic path than SGD.

59
Q

What is gradient descent with momentum, and why is it used?

A
  • Gradient descent with momentum adds a “memory” to the updates by incorporating the previous update direction.
  • This helps dampen oscillations in directions where updates repeatedly reverse (e.g., ellipsoids), allowing the optimization to approach the minimum more smoothly.
60
Q

How are weights and biases updated in gradient descent with momentum?

A
  • W = W − α v_dw
  • b = b − α v_db

​- where α is the learning rate.

61
Q

How are the velocity terms v_dw and v_db computed in gradient descent with momentum?

A
  • v_dw = βv_dw +(1−β)dW
  • v_db = βv_db +(1−β)db
  • where β is the damping factor
  • dW and db are the current gradients.
  • v_dw and v_db (in equation) are the old gradients
  • this is dynamically averaging the previous gradient estimates and adding this to gradient descent to direct the weight updates
62
Q

What are the hyperparameters in gradient descent with momentum?

A
  1. Learning rate α
  2. the damping factor β (commonly set to 0.9), which determines how much of the past momentum is retained.
63
Q

What is a running average in the context of optimization?

A
  • A running average is an example of exponential smoothing, where recent values are given slightly more importance, helping to smooth out noisy updates.
64
Q

How is a running average conceptually similar to the momentum update in optimization algorithms?

A

Both combine past gradients with the current one to smooth out noisy updates, resulting in more stable and consistent progress toward the minimum.

65
Q

How does momentum affect gradient descent?

A

Momentum averages all the gradient estimates, adding it to the gradient descent process, leading to a more stable pattern with reduced oscillations and overshooting.

66
Q

How does momentum help optimization in terms of local minima?

A

Momentum speeds up the optimization in directions without oscillations, enabling it to escape shallow local minima and reach the global minimum more efficiently.

67
Q

What issue does RMSProp address that momentum does not?

A
  • RMSProp addresses the issue of not considering the variance
  • Does this by keeping a moving average of the squared gradients, scaling the learning rate dynamically for each parameter.
68
Q

How does RMSProp stabilize training?

A

By adjusting the step size for each parameter dynamically based on the magnitude of past gradients, ensuring steady progress in high curvature or noisy regions of the loss landscape.

69
Q

What is the role of s_dw and s_db in RMSProp?

A

They represent the moving averages of the squared gradients for weights W and biases b, used to normalize the update step.

70
Q

How are the weight and bias updates computed in RMSProp?

A
  • W = W - α (dW/ sqrt(s_dw))
  • b = b - α (db/ sqrt(s_db))
71
Q

What is the recursive formula for the squared gradient moving average in RMSProp?

A

s_T = (1-β) sum(β^{t-i} * g^2_i)

72
Q

How does RMSProp adjust the step size based on uncertainty?

A
  • For high uncertainty (large or fluctuating gradients), the step size is small as α is diminished by the term.
  • For high certainty (consistent gradients), the step size is larger as α remains unaffected.
73
Q

What is Adam optimization a combination of?

A
  1. momentum (which smooths gradient updates using first-order moments)
  2. RMSProp (which adapts learning rates using second-order moments).
74
Q

What is the key problem with RMSProp during early training, and how does Adam address it?

A
  • The key problem with RMSProp during early training is that the moving averages of past gradients start close to zero and only gradually stabilize as more data points are processed.
  • This contraction toward zero makes the updates too conservative, resulting in very small step sizes that can slow down learning significantly.
  • Adam addresses this by applying bias correction, which adjusts the moving averages of the gradients and squared gradients to account for their initial underestimation.
  • This ensures that the step sizes reflect the true gradient magnitude early in training, allowing for more effective exploration of the loss surface.
75
Q

How does Adam optimization combine the strengths of momentum and RMSProp?

A

momentum

  • Computes moving averages of the gradients (v_dw and v_db) using an exponential decay factor (β_1, typically set to 0.9).
  • This smooths the updates and helps dampen oscillations, ensuring more stable convergence.

RMSprop

  • Computes the moving averages of the squared gradients (s_dw and s_db) using an exponential decay factor (β_2, typically set to 0.999)
  • This helps adapt the learning rate dynamically by normalizing updates based on the magnitude of recent gradients, preventing overly large updates in steep regions or small updates in flat regions.
76
Q

What are the hyperparameters of Adam, and what are their typical values?

A
  1. α: The learning rate, typically set to 0.001.
  2. β_1: Decay rate for the moving average of gradients, typically set to 0.9.
  3. β_2: Decay rate for the moving average of squared gradients, typically set to 0.999.
  4. ϵ: A small constant for numerical stability, typically set to 10^-8
77
Q

What are the key advantages of Adam optimization?

A
  1. Adaptive learning rates: By combining momentum and RMSProp, Adam adapts the learning rate for each parameter dynamically, leading to more efficient optimization.
  2. Bias correction: Ensures accurate step sizes even during early training when moving averages are still stabilizing.
  3. Fast convergence: Due to its bias correction and adaptive learning rates, Adam often converges faster than other optimizers.
  4. Robustness: Works well for a wide range of deep learning problems and requires little hyperparameter tuning compared to other algorithms.
78
Q

What is the main behavior of Gradient Descent?

A
  • Takes small, consistent steps downhill.
  • follows a straightforward but slow trajectory. It struggles in the narrower valley, requiring multiple steps to align itself with the global minimum.
79
Q

What is the main challenge of Gradient Descent?

A

It can be slow in regions with shallow gradients and may oscillate in narrow valleys.

80
Q

How does Momentum improve upon Gradient Descent?

A
  • It accelerates progress in consistent gradient directions by adding a “velocity” term and dampens oscillations.
  • smoother strides and avoids oscillations. It converges more quickly into the global minimum by effectively handling the steep slopes.
81
Q

What is the advantage of using Momentum?

A

It helps navigate narrow valleys and speeds up convergence compared to plain Gradient Descent.

82
Q

How does RMSprop handle steep slopes and flat regions effectively?

A

By using an exponentially weighted moving average of squared gradients to adaptively scale the learning rate.

83
Q

What are the advantages of RMSprop?

A

It converges efficiently in both steep and flat regions, avoiding overly aggressive learning rate decay.

84
Q

How does Adam optimization combine the features of other methods?

A

It combines Momentum and RMSprop, using both a running average of gradients and their squared values with bias correction.

85
Q

What happens to all optimizers at saddle points?

A
  • Adam avoids getting stuck at saddle points
  • All other optimizers get stuck.
86
Q

Which optimizers can reach the global minimum with a slight increase in convexity?

A

Momentum and Adam.

87
Q

What is the primary goal of hyperparameter tuning?

A

The goal is to optimize hyperparameters to minimize a loss function f(x) by finding the best set of values that improve model performance.

88
Q

Why should you avoid using grid search for hyperparameter tuning?

A

Because it leads to exponential explosion in the number of evaluations, making it computationally expensive.

89
Q

What is a more efficient alternative to grid search for hyperparameter tuning?

A

Random search or Bayesian optimization is recommended as it explores the parameter space more efficiently.

90
Q

What is an effective strategy for hyperparameter tuning?

A
  1. Go from coarse to fine.
  2. Pick an appropriate scale for each hyperparameter.
  3. Run the optimization in parallel.
91
Q

How does random search differ from grid search in hyperparameter tuning?

A

Bayesian optimization uses a probabilistic model (surrogate) to predict the performance of different hyperparameter values and focuses on areas with high expected improvement.

92
Q

What does Bayesian optimization use to model the unknown function?

A

It uses a surrogate model, often a Gaussian Process (GP), which provides predictions and uncertainty estimates for the function.

93
Q

What are the key components of a Gaussian Process used in Bayesian optimization?

A

The key components are the mean function (prediction) and the covariance function (measuring similarity between points).

94
Q

How does Bayesian optimization decide where to evaluate the next point in the hyperparameter space?

A

It uses a metric called Expected Improvement (EI) to choose the next point by balancing exploration and exploitation.

95
Q

What does the Expected Improvement (EI) metric measure in Bayesian optimization?

A

EI measures the expected gain from evaluating a new point, considering both the potential for improvement and uncertainty.

96
Q

How does the surrogate model evolve during the Bayesian optimization process?

A

As more evaluations are performed, the surrogate model becomes more accurate by interpolating the observed data points better.

97
Q

Why are uncertainty bounds important in Bayesian optimization?

A

They help determine areas with high uncertainty where further sampling could yield significant improvements.

98
Q

What is the key insight behind using few curves vs. many curves in Bayesian optimization?

A

Few curves indicate high uncertainty, while many curves show increasing confidence in specific regions as more evaluations are performed.

99
Q

What does a sharp peak in the histogram during Bayesian optimization indicate?

A

It indicates that the model is highly confident about the location of the best hyperparameter.

100
Q

What happens to uncertainty as you sample more points around a specific region?

A

Uncertainty decreases as the model gains more information about that region.

101
Q

In Bayesian optimization, why does sampling far from known points often lead to higher expected gain?

A

Sampling far from known points that are not the minimum explores new areas with high uncertainty, potentially leading to large improvements.

102
Q

What is the purpose of using a surrogate model in Bayesian optimization?

A

The surrogate model approximates the true loss function to guide the search for the best hyperparameters more efficiently.

103
Q

Why is parallel evaluation important in hyperparameter tuning?

A

Parallel evaluation speeds up the tuning process by testing multiple hyperparameter sets simultaneously.

104
Q

Interpretation of the combination of the EI function

A
  • tells us how to use this distribution to find out a next point to sample
  • takes the best (lowest) point i’ve seen so far and compare it to the point i want to sample and multiply it by the density that ive estimated
  • if there is a large difference between what i measure, but the density shows a low likelihood of the minimum being at that location, you won’t use it (outputs a low value).
  • if there is a large difference and the distribution shows a high likelihood, this function outputs a high value.