Gradient Descent Flashcards

1
Q

What is Gradient Descent?

A

Gradient Descent is an iterative optimization algorithm used to minimize a loss/cost function by adjusting model parameters in the direction of the negative gradient. It is foundational in machine learning for training models like linear regression and neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the general update rule for Gradient Descent?

A

For parameters θ and learning rate α:
θ = θ - α ⋅ ∇J(θ),
where ∇J(θ) is the gradient of the cost function J(θ).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Batch Gradient Descent (BGD)?

A

BGD computes the gradient using the entire training dataset and updates parameters once per epoch. It is stable but computationally expensive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the cost function for BGD in linear regression (MSE)?

A

J(θ) = (1/2m) Σ(hθ(xⁱ) - yⁱ)²,
where m = number of training examples, hθ(x) = θᵀx.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the gradient of MSE for BGD?

A

∇J(θ) = (1/m) Σ(hθ(xⁱ) - yⁱ) ⋅ xⁱ.
For parameter θ_j: ∂J/∂θ_j = (1/m) Σ(hθ(xⁱ) - yⁱ) ⋅ x_jⁱ.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the pros and cons of BGD?

A

Pros: Stable convergence, accurate gradients. Cons: Slow for large datasets, high memory usage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Stochastic Gradient Descent (SGD)?

A

SGD updates parameters per training example (one example at a time). It is faster but noisier than BGD.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the update rule for SGD?

A

For a single example (xⁱ, yⁱ):
θ = θ - α ⋅ (hθ(xⁱ) - yⁱ) ⋅ xⁱ.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why does SGD have noisy updates?

A

Because it uses a single example’s gradient, which may not represent the true gradient of the entire dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Mini-Batch Gradient Descent?

A

A compromise between BGD and SGD. It uses small batches (e.g., 32, 64 examples) to compute gradients. Balances speed and stability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the update rule for Mini-Batch GD?

A

For a batch size b:
θ = θ - α ⋅ (1/b) Σ(hθ(xⁱ) - yⁱ) ⋅ xⁱ
for i=1 to b.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is Mini-Batch GD widely used in deep learning?

A

It leverages vectorized operations for efficiency, avoids SGD’s noise, and scales well to large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the role of the learning rate (α)?

A

α controls the step size of parameter updates. Too small: slow convergence. Too large: oscillations/divergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does learning rate decay improve convergence?

A

Reducing α over time (e.g., α = α₀ / (1 + decay_rate * epoch)) helps fine-tune parameters near the minimum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between BGD, SGD, and Mini-Batch GD in terms of updates?

A
  • BGD: 1 update/epoch.
  • SGD: m updates/epoch (m = dataset size).
  • Mini-Batch: m/b updates/epoch (b = batch size).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is an ‘epoch’?

A

One full pass through the entire training dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a ‘local minimum’? How does SGD escape it?

A

A point where J(θ) is lower than nearby points but not the global minimum. SGD’s noisy updates can ‘jump’ out of local minima.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a saddle point? Why is it problematic?

A

A point where gradients are zero but not a minimum (common in high-dimensional spaces). SGD’s noise helps escape it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why is feature scaling important for Gradient Descent?

A

Scaling features (e.g., normalization) ensures all parameters update at the same rate, speeding up convergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the convergence criterion for Gradient Descent?

A

Stop when ||∇J(θ)|| < ε (small threshold) or when J(θ) changes minimally between epochs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How does BGD guarantee convergence for convex functions?

A

For convex J(θ), BGD converges to the global minimum with a sufficiently small α.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the time complexity of BGD vs. SGD?

A
  • BGD: O(m) per epoch.
  • SGD: O(1) per update (total O(m) per epoch).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the ‘vanishing gradient’ problem?

A

In deep networks, gradients become extremely small, slowing down updates in early layers. Not directly related to GD variants.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is momentum in Gradient Descent?

A

Momentum (e.g., v = γv + α∇J(θ); θ = θ - v) accelerates updates in consistent gradient directions, reducing oscillations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is AdaGrad?

A

An adaptive learning rate method: α is scaled per parameter using past squared gradients. Works well for sparse data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is RMSProp?

A

Improves AdaGrad by using an exponentially weighted average of squared gradients to handle non-convex problems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is Adam?

A

Combines momentum and RMSProp:
m_t = β₁m_{t-1} + (1-β₁)g_t
v_t = β₂v_{t-1} + (1-β₂)g_t²
θ = θ - α ⋅ m_t / (√v_t + ε).

28
Q

Why is Mini-Batch GD preferred over SGD in practice?

A

Mini-Batch GD uses hardware parallelism (e.g., GPUs) for batch computations, reducing noise compared to SGD.

29
Q

How do you choose the batch size in Mini-Batch GD?

A

Common choices: 32, 64, 128. Small batches add noise (regularization), large batches mimic BGD’s stability.

30
Q

What is online learning? How does it relate to SGD?

A

Online learning updates models incrementally as data arrives (e.g., streaming). SGD is a form of online learning.

31
Q

What is early stopping?

A

Halting training when validation error starts increasing to prevent overfitting. Used with all GD variants.

32
Q

How does regularization (e.g., L2) affect GD updates?

A

Adds a penalty term (e.g., λ||θ||²) to the loss. Gradients become ∇J(θ) + 2λθ.

33
Q

What is the difference between GD and Newton’s method?

A

Newton’s method uses the Hessian (second derivatives) for faster convergence but is computationally expensive.

34
Q

Why is SGD used for non-convex loss functions (e.g., neural networks)?

A

Its noisy updates help escape local minima, making it suitable for complex, non-convex optimization.

35
Q

What is a learning rate schedule? Example?

A

A plan to adjust α during training. Example: α = α₀ * e^(-kt), where t is the iteration number.

36
Q

How does parallelization improve Mini-Batch GD?

A

Batch computations can be split across multiple cores/GPUs, speeding up training.

37
Q

What is gradient clipping?

A

Limiting gradient values to a maximum threshold to prevent exploding gradients in deep networks.

38
Q

What is the Polyak-Learning Rate?

A

α = (J(θ) - J) / ||∇J(θ)||², where J is the optimal loss. Rarely used practically but theoretically optimal.

39
Q

What is the disadvantage of using a fixed learning rate?

A

It may cause slow convergence (if too small) or oscillations (if too large). Adaptive methods (Adam) mitigate this.

40
Q

What is the role of bias correction in Adam?

A

Corrects initial estimates of m_t and v_t to account for zero initialization (m_t / (1 - β₁^t)).

41
Q

How does SGD handle redundant training examples?

A

Redundant examples may bias gradients, but shuffling the dataset mitigates this.

42
Q

What is the effect of batch size on generalization?

A

Small batches (e.g., SGD) act as a regularizer, often improving generalization by adding noise.

43
Q

What is the difference between GD and genetic algorithms?

A

GD is deterministic and gradient-based. Genetic algorithms are stochastic and inspired by evolution.

44
Q

How does GD handle non-differentiable loss functions?

A

Subgradient methods or proximal GD are used (e.g., for L1 regularization).

45
Q

What is coordinate descent? How is it different from GD?

A

Updates one parameter at a time instead of all parameters. Useful for high-dimensional problems.

46
Q

What is the Hessian matrix? Why is it not used in GD?

A

The Hessian contains second-order partial derivatives. GD uses first-order gradients; Newton’s method uses Hessians.

47
Q

What is the stochastic variance reduced gradient (SVRG)?

A

A variant of SGD that reduces variance by combining current and historical gradients.

48
Q

What is the disadvantage of AdaGrad?

A

The learning rate can become infinitesimally small over time, halting updates.

49
Q

What is the difference between GD and backpropagation?

A

GD is the optimization algorithm. Backpropagation computes gradients efficiently in neural networks.

50
Q

What is Nesterov Accelerated Gradient (NAG)?

A

A momentum variant that ‘looks ahead’ to the future gradient for smoother updates.

51
Q

What is the Fletcher-Reeves method?

A

A conjugate gradient method that avoids computing the Hessian, using past gradients for direction.

52
Q

What is the Wolfe condition?

A

A set of criteria to ensure sufficient decrease in the loss when choosing a step size in line search methods.

53
Q

What is the Armijo rule?

A

A condition to guarantee a sufficient decrease in the loss during line search.

54
Q

How does GD perform in high-dimensional spaces?

A

It can struggle with saddle points and flat regions. Adaptive methods (Adam) help navigate such landscapes.

55
Q

What is the relationship between GD and the Normal Equation?

A

The Normal Equation (θ = (XᵀX)⁻¹Xᵀy) solves linear regression in one step, while GD iteratively approaches the solution.

56
Q

How does GD handle multicollinearity in data?

A

GD will still converge, but the learning rate may need tuning. Regularization (e.g., L2) is often added.

57
Q

What is the disadvantage of using BGD for online learning?

A

BGD requires the entire dataset, making it incompatible with streaming/online data.

58
Q

What is the regret in online learning? How does SGD minimize it?

A

Regret measures the difference between the learner’s loss and the best fixed parameter’s loss. SGD has sublinear regret.

59
Q

What is the difference between GD and Expectation-Maximization (EM)?

A

EM is used for latent variable models (e.g., GMMs), while GD is a general-purpose optimizer.

60
Q

What is the ELBO in variational inference? How is GD used?

A

The Evidence Lower BOund is maximized using GD to train variational autoencoders (VAEs).

61
Q

What is the role of GD in reinforcement learning?

A

GD optimizes policy or value function parameters (e.g., in policy gradient methods).

62
Q

How does GD relate to the Karush-Kuhn-Tucker (KKT) conditions?

A

KKT conditions generalize gradients to constrained optimization. GD is used in unconstrained settings.

63
Q

What is the difference between GD and Frank-Wolfe?

A

Frank-Wolfe solves constrained problems by iteratively moving toward the steepest feasible direction.

64
Q

What is mirror descent?

A

A generalization of GD using Bregman divergences for non-Euclidean geometry.

65
Q

What is the role of GD in meta-learning?

A

GD is used to optimize hyperparameters or learn initialization parameters for fast adaptation.