Lecture 8 - Gradient Methods Flashcards
What type of space affect optimisation (qualties of Hypothesis Space)
- Continuous: No sudden jumps; gradient methods work well.
- Discontinuous: May break gradient methods.
- Differentiable: Needed for gradient descent.
- Non-differentiable: Use direct methods instead.
- Low modality: Few minima (easier to optimize).
- High modality: Many local minima (harder).
- Pathological: Strange landscapes, e.g., spiky or flat plateaus.
Why is it important to know the affects of spaces?
Because gradient descent assumes a nice, smooth bowl. If the space is bumpy or weird, the method may fail or converge slowly.
What do derivatives tell us?
- First derivative f′(x): tells you the slope — use it to go up or down.
- Second derivative f′′(x): tells you the curvature — use it to judge how fast slope is changing.
What questions do derivatives answer?
- Where is the slope 0? ⇒ potential extrema (stationary points)
- Is the point a min, max, or saddle?
- Which way should I move for fastest descent?
What is Gradient Descent and its rule?
Gradient descent is an iterative optimization algorithm used to find the minimum of a function.
It works by moving in the direction of the negative gradient, which is the direction of steepest decrease in the function.
x <- x - alpha * f’(x)
What is Gradient Ascent and its rule?
Gradient ascent is the same idea, but instead of minimizing, it maximizes the function.
You move in the direction of the gradient — where the function increases most quickly.
x <- x + alpha * f’(x)
What is the Gradient Ascent/Descent Algorithm process?
REFER TO SLIDES FOR BREAKDOWN
What is the Stopping Criteria of Gradient Ascent/Descent?
You stop when:
* f′(x)=0: The slope is flat — potential min or max.
* ∣f′(x)∣<ϵ: Close enough to flat.
* Time or iteration limit reached.
What do you need to consider about the Stopping Criteria of Gradient Ascent/Descent?
- A zero gradient (f′(x)=0) doesn’t always mean you’ve found a minimum
- Could be:
○ Maximum (gradient ascent)
○ Minimum (gradient descent)
○ Saddle point (neither — it’s flat but unstable) - How do we tell the difference?
○ Use the second derivative, f′′(x)
How do you determine if its a local min or local max?
f’(x) = 0, f’‘(x) > 0 (minimum)
f’(x) = 0, f’‘(x) < 0 (maximum)
f’(x) = 0, f’‘(x) = 0 (inconclusive)
What is the Ideal Case of Gradient Ascent/Descent?
○ The function is smooth and differentiable
§ No jumps, kinks, or flat regions
§ Gradient exists everywhere
○ It has a single global minimum or maximum
§ No local minima or maxima to get stuck in
§ Gradient descent will always find the right answer
○ The gradient behaves predictably
§ Far from the minimum → big slope → big step
§ Close to the minimum → small slope → small step
This means you naturally slow down as you approach the minimum → smooth convergence.
Why does the Ideal Case for Gradient Ascent/Descent matter??
- No overshooting
- No weird local optima
- No need to adjust α much
- Convergence is fast and stable
What is Derivative Step Size?
- When you’re far from the minimum, the slope (derivative) is steep → the gradient is large → you take bigger steps
- When you’re close to the minimum, the slope flattens out → the gradient is small → you take smaller steps
Why is Derivative Step Size helpful?
Because in many functions (especially quadratics), the gradient acts like a natural brake:
- Far away? → move quickly
- Close to the target? → slow down automatically
- This helps avoid overshooting the minimum
Hence, the derivative acts like a natural step-size scaler when α=1.
What is Rayleigh Distribution Case?
- A Rayleigh distribution is asymmetric:
○ Steep on one side
○ Flat on the other - Where the Gradient descent might:
○ Take small steps where slope is flat
○ Overshoot where slope is steep
Why is it a problem if the Gradient takes small steps or overshoots in Rayleigh Distribution
Why is this a problem?
- Poor convergence, unpredictable behaviour. Choosing α becomes difficult.
What are the Tradeoffs when choosing Step Size for Rayleighs distribution?
Choosing Step Size
Trade-offs:
- Too small:
○ Converges slowly
○ Wastes computation
- Too large:
○ Overshoots the minimum
○ Can oscillate or diverge
What is the Newton-Raphson Method?
The Newton-Raphson Method is a fast, iterative algorithm used to find:
1. The roots (zeros) of a function (where f(x)=0), or
2. The optima of a function (where f′(x)=0)
How do you find the roots in Newton-Raphson Method
xn + 1 = xn - ( ((f’(x)) / (f’‘(x)) )
How is Newton-Raphson used for Optimisation?
You apply the formula to the derivative to find the minima or maxima
Newton-Raphson Algorithm
REFER TO SLIDES
What is Smoothness?
Smoothness refers to how “nice” or well-behaved a function is, especially in terms of:
* Continuity (no jumps)
* Differentiability (has a slope)
* Second-order differentiability (has curvature)
Smoothness is based off classes -REFER TO NOTES
Why is Smoothness important in Newton-Raphson
It requires a C^2 class where the function is continuous, first and second derivative exists and those derivatives are also continuous
What are the limitations of Newton-Raphson?
Requires second derivative - not all functions a twice differential
Can diverge - if stating points is too far from root or second derivaive is too small or 0
Doesnt guarantee global minimum - only finds local optimum which could a min, max or saddle point
What are the limitations of Gradient Ascent/Descent?
Needs careful choice of step size - too big it will overshoot, too small slow convergence
May get stuck in local minima
Slows down near the minimum - as gradient gets too small
What is a Local Optimum?
A local optimum is a point where the function is better than its immediate neighbours.
What is a Global Optimum?
A global optimum is the absolute best value across the entire domain.
What is the general understanding of Global Optimum?
- There is no general-purpose algorithm that can guarantee finding the global optimum in all cases.
- Although we can still find good solutions, even if we can’t guarantee the best one.
○ Why is this important?
* Because in the real world, we:
□ Rarely need perfect solutions
□ Often just want “good enough”, quickly
□ Work in messy domains with weird curves, noise, or constraints
What are the approaches used to find the good solutions?
- We use:
□ Gradient descent with restarts
□ Stochastic methods (e.g., simulated annealing, genetic algorithms)
□ Heuristics or metaheuristics (e.g., greedy strategies, random sampling) - These don’t guarantee global optima, but they explore the space better and often find excellent local optima.
Q: Why can’t gradient methods guarantee a global optimum?
Because they follow the gradient, which only leads to a local minimum or maximum. In non-convex functions, there may be multiple optima. Without exploring the entire space (which is often infinite or too large), there’s no way to know if a better one exists elsewhere.
What are the fundamental limitations in optimisation (especially when trying to find the global optimum)
Non-Enumerable Domains - the domain is infinite, this is an issues as you cant test every possible value
Huge Search Spaces - So many possibilites that you cannot search them in reasonable time, this is an issue as you might need too long which defeats the purpose
What is Gradient Ascent with Restarts?
This method is an enhancement of basic gradient ascent. It helps address the problem of getting stuck in a local maximum.
- Instead of trusting just one run of gradient ascent, you try multiple random starts and keep track of the best solution you find.
- This increases your chances of getting closer to the global maximum.
Why use Gradient Ascent with Restarts?
- Basic gradient ascent is deterministic: it follows the same path from the same start point.
- If it starts near a local maximum, it may stop there and miss better peaks.
- Restarting from different points gives a better chance to explore more of the space.
Gradient Aescent with Restarts Algorithm
REFER TO NOTES
What is the Stopping Criteria for Gradient Ascent with Restarts?
- The inner loop stops when ∣∇f(x)∣<ϵ→ slope is nearly zero (flat)
- The outer loop stops when:
○ You’ve run a fixed number of iterations
○ Or a time/resource limit is reached
Q: Why does adding restarts improve Gradient Ascent?
Gradient ascent can get stuck in local maxima. By restarting from different random positions, the algorithm explores multiple regions of the search space. Comparing function values across all runs helps identify the best peak found, even if it’s not the global optimum.
Q: What extra requirement does gradient ascent with restarts introduce?
You must be able to compute the actual function value f(x) (not just its gradient) to compare and retain the best result.
How does this method (Gradient Ascent with Restarts) try to overcome the challenge of local optima?
Through multiple random initializations and tracking the best result, it reduces the chance of getting stuck at a poor-quality local maximum.
What are the Challanges in Optimisation?
Large, Space Search Spaces
Local Optima Traps
Plateaus
Non-differential points
Valleys in High Dimentions
Large, Space Search Spaces - Challanges in Optimisation
Space of possible solutions is too big
- This is a problem as you may be wasting time, Gradient method may converge too slowly, cant check every point
Local Optima Traps - Challanges in Optimisation
Points looks like a minimum or maximum but its not the global best
- Problem as GD or NR methods stop at slope = 0, could be stuck in a local dip
Plateaus - Challanges in Optimisation
A flat region where the slope is zero or nearly zero everywhere
- Problem as GD relise on the slope to know where to move, very slow or no progress
Non-differential points - Challanges in Optimisation
No derivative at this point
- Problem as gradient based function need the function to be differential, the slope is undefinedm breaking the behaviour
Valleys in High Dimentions - Challanges in Optimisation
Narrow curved valleys where the gradient, points off the correct direction or changes slowly along one axis and slow along the other
- Problem as GD sutrggles as it moves in one direction at a time, leads to slow inefficient convergences