Topic 8: Intro to Gradient Descent Flashcards
What can we say about ε in GD
“target accuracy”
Gradient Descent runtime scales logarithmically with target accuracy (ε)
O (log (1/ε)) number of steps
Why is x0 ≠ 0 in GD
As gradient descent works logarithmically:
xt+1 =xt − ηtF ′(xt)
It would mean all xt onwards are multiplied by x0 so would all be 0 and the algorithm would never move
Furthermore if the algorithm starts at F’(x0) = 0
(ie x0 = critical point) then it would never move
Where do we want xt to tend
xt -> 0 as t -> inf
What is an objective function
The loss function
The function you want to maximise or minimise
Describe the GD algorithm steps on a differentiable function F
Initial w1 and T>0 (number of steps)
set ηt > 0
for t=1 to T
wt+1 = wt − ηt ⋅ ∇F(wt)
output wt
what is a differentiable function
If it has a well-defined derivative at every point in its domain
What properties does a function need to have for gd
Differentiable
convexity and lipschitzness are ideal
existence of at least one minima
why does gradient descent converge faster on strongly convex functions compared to non-convex ones
strong convexity:
provides unique global minimum
provides a convergence rate guarantee for gradient descent
what is k in gd
k is a constant parameter chosen from the interval (0,1)
It’s a tuning parameter that allows you to control the size of the step length
is not always present in all gd
will gd find local minima for non convex non lipshcitz functions
yes
as is often the case irl, functions are not convex and lipschitz
minima is still found through careful consideration of step size and algorithm parameters