lecture 8: optimisation and gradient descent Flashcards
in general, an objective function can be described by
cost function = loss function(learning model) + regularisation
this as a whole, plus the optimisation routine used to solve, makes up the building blocks of ML algorithms
what is gradient descent
moving along the model in the direction where gradient is decreasing according to some learning rate eta(𝜂) until some convergence criteria is reached, to find an estimate of the local minima
what are some possible convergence criteria
- set maximum iteration k
- check percentage/absolute change in C below a threshold
- check percentage/absolute change in w below a threshold
why can gradient descent only find the local minima?
because gradient = 0 at local minima, so w cannot change after that
what does increasing the learning rate do
it increases the rate at which w converges, too high of an eta can cause the function to repeatedly overshoot the local minima
what is the purpose of loss functions
different loss functions encode the penalty for prediction when the true value is yᵢ
what is binary loss function
the product of f(x,w) and yᵢ
if the sign is positive, classification is correct and and loss is 0, if sign is negative then classification is wrong and loss is 1
binary loss is not differentiable, what are 2 other possible loss functions to use?
hinge loss and exponential loss
hinge loss = max(0, 1 - prediction)
exponential loss = exp(-prediction)
what does the sigmoid function do
used for classification, maps pᵢᵀw(which is between ∞ and -∞) to between 1 and -1
f(x,w) = σ(pᵢᵀw)
σ(a) = 1/(1 + e⁻ª)