lecture 6 - DNNs Flashcards

Question

How can you reduce high bias in a neural network?

Answer 1

1. Use a larger network. 2. Train for a longer duration. 3. Adjust the architecture (e.g., add more layers or units).

Answer 2

- The model has large errors on both the training set and an even larger error on the dev set, indicating that it neither fits the training data well nor generalizes. - Example: Train set error = 15%, Dev set error = 30%.

Answer 3

Address the bias first (by increasing model capacity or training longer) and then reduce variance (by applying regularization or using more data).

Answer 4

- This is the ideal case where both training and dev errors are low, with the dev set error slightly higher than the training set error. - Example: Train set error = 0.5%, Dev set error = 1%.

Answer 5

circles exactly around all the targets

Answer 6

rigid straight/slanted line that captures most of the points

Answer 7

straight line that makes an exception for outlying point

Answer 8

circles around most of the target points but leaves the outlying point out.

Answer 9

Dropout is a regularization technique used to prevent overfitting by randomly dropping out a proportion of neurons during the training process. The dropped-out neurons do not participate in forward and backward propagation.

Answer 10

By randomly turning off neurons, dropout forces the network to distribute the learned weights across different neurons, preventing any single neuron from becoming too dominant and improving the model’s generalization ability.

Answer 11

As the number of layers increases, using a single regularization term becomes impractical because it adds too many hyperparameters to tune. Dropout addresses this issue without adding extra hyperparameters.

Answer 12

1. **L2 Regularization**: Prevents overly large weights. 2. **Dropout**: Prevents reliance on specific neurons.

Answer 13

- Alternative to regularization - Early stopping involves halting the training process at the point where the **validation error is smallest**, preventing overfitting by not allowing the model to train further once it starts overfitting.

Answer 14

Data augmentation artificially increases the size and diversity of the training dataset by applying transformations (e.g., rotations, translations) to the data, making the model more robust to variations.

Answer 15

It helps models learn to **ignore irrelevant variations**, such as minor shifts or distortions, and focus on the important features that define the object or class.

Answer 16

Without normalization, input features with varying scales can lead to elongated cost function contours, making gradient descent inefficient as it must zig-zag through narrow valleys to converge.

Answer 17

Normalizing inputs (scaling to have a mean of 0 and variance of 1) results in circular contours, allowing gradient descent to converge faster by following a more direct optimization path.

Answer 18

Batch normalized layers normalize activations after a few forward propagation layers, then propagate forward again. This helps maintain learnability and improves training stability and speed.

Answer 19

Always normalize input features to improve training stability and speed in deep learning models.

Answer 20

- In very deep networks, the output can be expressed as a product of many weight matrices. - If the largest eigenvalue of W is greater than 1, gradients explode - If the largest eigenvalue of W is less than 1, gradients vanish. - Both scenarios cause instability during training.

Answer 21

- Gradients become very small (approach 0) as they backpropagate through deep layers, especially with activation functions like Sigmoid or Tanh. - This causes the weights in earlier layers to update very slowly, leading to poor learning.

Answer 22

One solution is to pass information of the gradient to the layer after the next layer (**skip layer connection**), helping maintain gradient flow.

Answer 23

- Gradients grow excessively large in deeper layers, causing instability in learning and - Leads to large oscillations in gradient descent or numerical overflow.

Answer 24

One solution is **gradient clipping**, which limits the maximum value of the gradient to prevent large jumps in gradient descent.

Answer 25

- Mini-batch gradient descent is an optimization algorithm that divides the training dataset into smaller subsets called batches. - The algorithm computes gradients based on a batch rather than the entire dataset or a single sample.

Answer 26

Mini-batch gradient descent allows faster computation by leveraging vectorization and reduces memory requirements compared to batch gradient descent.

Answer 27

In mini-batch gradient descent, forward propagation is performed on a set of data (a mini-batch), the error is measured, and the derivative of the error is used in backpropagation.

Answer 28

1. Batch Gradient Descent: The cost decreases steadily with iterations. 2. Mini-batch Gradient Descent: The cost converges to the same place as batch gradient descent but oscillates around it.

Answer 29

Batch gradient descent uses the entire training dataset to compute the gradient for each iteration, resulting in smooth convergence but being computationally expensive for large datasets.

Answer 30

It is represented in blue, showing a **smooth path** toward the optimum.

Answer 31

SGD updates the weights after computing the gradient for a single data point, resulting in much faster updates but noisier and more erratic convergence, with a higher risk of overshooting the minimum.

Answer 32

It is represented in purple, showing a **zig-zag, erratic path**.

Answer 33

Mini-batch gradient descent combines the faster updates of SGD with the smoother trajectory of batch gradient descent, circling around the optimal value.

Answer 34

It is represented in green, showing a **less chaotic path than SGD**.

Answer 35

- Gradient descent with momentum adds a "memory" to the updates by incorporating the previous update direction. - This helps dampen oscillations in directions where updates repeatedly reverse (e.g., ellipsoids), allowing the optimization to approach the minimum more smoothly.

Answer 36

- W = W − α v_dw - b = b − α v_db - where α is the learning rate.

Answer 37

- v_dw = βv_dw +(1−β)dW - v_db = βv_db +(1−β)db - where β is the damping factor - dW and db are the current gradients. - v_dw and v_db (in equation) are the old gradients - **this is dynamically averaging the previous gradient estimates and adding this to gradient descent** to direct the weight updates

Answer 38

1. Learning rate α 2. the damping factor β (commonly set to 0.9), which determines how much of the past momentum is retained.

Answer 39

- A running average is an example of exponential smoothing, where **recent values are given slightly more importance**, helping to smooth out noisy updates.

Answer 40

Both combine past gradients with the current one to smooth out noisy updates, resulting in more stable and consistent progress toward the minimum.

Answer 41

Momentum averages all the gradient estimates, adding it to the gradient descent process, leading to a more stable pattern with reduced oscillations and overshooting.

Answer 42

Momentum speeds up the optimization in directions without oscillations, enabling it to escape shallow local minima and reach the global minimum more efficiently.

Answer 43

- RMSProp addresses the issue of **not considering the variance** - Does this by keeping a moving average of the **squared gradients**, scaling the learning rate dynamically for each parameter.

Answer 44

By adjusting the step size for each parameter dynamically based on the magnitude of past gradients, ensuring steady progress in high curvature or noisy regions of the loss landscape.

Answer 45

They represent the moving averages of the squared gradients for weights W and biases b, used to normalize the update step.

Answer 46

- W = W - α (dW/ sqrt(s_dw)) - b = b - α (db/ sqrt(s_db))

Answer 47

s_T = (1-β) sum(β^{t-i} * g^2_i)

Answer 48

- For high uncertainty (large or fluctuating gradients), the step size is small as α is diminished by the term. - For high certainty (consistent gradients), the step size is larger as α remains unaffected.

Answer 49

1. momentum (which smooths gradient updates using first-order moments) 2. RMSProp (which adapts learning rates using second-order moments).

Answer 50

- The key problem with RMSProp during early training is that the **moving averages of past gradients start close to zero** and only gradually stabilize as more data points are processed. - This contraction toward zero makes the updates too conservative, resulting in very small step sizes that can slow down learning significantly. - Adam addresses this by applying bias correction, which adjusts the moving averages of the gradients and squared gradients to account for their initial underestimation. - This ensures that the step sizes reflect the true gradient magnitude early in training, allowing for more effective exploration of the loss surface.

Answer 51

**momentum** - Computes moving averages of the gradients (v_dw and v_db) using an exponential decay factor (β_1, typically set to 0.9). - This smooths the updates and helps dampen oscillations, ensuring more stable convergence. **RMSprop** - Computes the moving averages of the squared gradients (s_dw and s_db) using an exponential decay factor (β_2, typically set to 0.999) - This helps adapt the learning rate dynamically by normalizing updates based on the magnitude of recent gradients, preventing overly large updates in steep regions or small updates in flat regions.

Answer 52

1. α: The learning rate, typically set to 0.001. 2. β_1: Decay rate for the moving average of gradients, typically set to 0.9. 3. β_2: Decay rate for the moving average of squared gradients, typically set to 0.999. 4. ϵ: A small constant for numerical stability, typically set to 10^-8

Answer 53

1. **Adaptive learning rates**: By combining momentum and RMSProp, Adam adapts the learning rate for each parameter dynamically, leading to more efficient optimization. 2. **Bias correction**: Ensures accurate step sizes even during early training when moving averages are still stabilizing. 3. **Fast convergence**: Due to its bias correction and adaptive learning rates, Adam often converges faster than other optimizers. 4. **Robustness**: Works well for a wide range of deep learning problems and requires little hyperparameter tuning compared to other algorithms.

Answer 54

- Takes small, consistent steps downhill. - follows a straightforward but slow trajectory. It struggles in the narrower valley, requiring multiple steps to align itself with the global minimum.

Answer 55

It can be slow in regions with shallow gradients and may oscillate in narrow valleys.

Answer 56

- It accelerates progress in consistent gradient directions by adding a "velocity" term and dampens oscillations. - smoother strides and avoids oscillations. It converges more quickly into the global minimum by effectively handling the steep slopes.

Answer 57

It helps navigate narrow valleys and speeds up convergence compared to plain Gradient Descent.

Answer 58

By using an exponentially weighted moving average of squared gradients to adaptively scale the learning rate.

Answer 59

It converges efficiently in both steep and flat regions, avoiding overly aggressive learning rate decay.

Answer 60

It combines Momentum and RMSprop, using both a running average of gradients and their squared values with bias correction.

Answer 61

- Adam avoids getting stuck at saddle points - All other optimizers get stuck.

Answer 62

Momentum and Adam.

Answer 63

The goal is to optimize hyperparameters to minimize a loss function f(x) by finding the best set of values that improve model performance.

Answer 64

Because it leads to exponential explosion in the number of evaluations, making it computationally expensive.

Answer 65

**Random search or Bayesian optimization** is recommended as it explores the parameter space more efficiently.

Answer 66

1. Go from coarse to fine. 2. Pick an appropriate scale for each hyperparameter. 3. Run the optimization in parallel.

Answer 67

Bayesian optimization uses a probabilistic model (surrogate) to predict the performance of different hyperparameter values and focuses on areas with high expected improvement.

Answer 68

It uses a surrogate model, often a Gaussian Process (GP), which provides predictions and uncertainty estimates for the function.

Answer 69

The key components are the mean function (prediction) and the covariance function (measuring similarity between points).

Answer 70

It uses a metric called Expected Improvement (EI) to choose the next point by balancing exploration and exploitation.

Answer 71

EI measures the expected gain from evaluating a new point, considering both the potential for improvement and uncertainty.

Answer 72

As more evaluations are performed, the surrogate model becomes more accurate by interpolating the observed data points better.

Answer 73

They help determine areas with high uncertainty where further sampling could yield significant improvements.

Answer 74

Few curves indicate high uncertainty, while many curves show increasing confidence in specific regions as more evaluations are performed.

Answer 75

It indicates that the model is highly confident about the location of the best hyperparameter.

Answer 76

Uncertainty decreases as the model gains more information about that region.

Answer 77

Sampling far from known points that are not the minimum explores new areas with high uncertainty, potentially leading to large improvements.

Answer 78

The surrogate model approximates the true loss function to guide the search for the best hyperparameters more efficiently.

Answer 79

Parallel evaluation speeds up the tuning process by testing multiple hyperparameter sets simultaneously.

Answer 80

- tells us how to use this distribution to find out a next point to sample - takes the best (lowest) point i’ve seen so far and compare it to the point i want to sample and multiply it by the density that ive estimated - if there is a large difference between what i measure, but the density shows a low likelihood of the minimum being at that location, you won’t use it (outputs a low value). - if there is a large difference and the distribution shows a high likelihood, this function outputs a high value.

lecture 6 - DNNs Flashcards

(104 cards)