lecture 8: optimisation and gradient descent Flashcards

1
Q

in general, an objective function can be described by

A

cost function = loss function(learning model) + regularisation
this as a whole, plus the optimisation routine used to solve, makes up the building blocks of ML algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is gradient descent

A

moving along the model in the direction where gradient is decreasing according to some learning rate eta(𝜂) until some convergence criteria is reached, to find an estimate of the local minima

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what are some possible convergence criteria

A
  1. set maximum iteration k
  2. check percentage/absolute change in C below a threshold
  3. check percentage/absolute change in w below a threshold
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

why can gradient descent only find the local minima?

A

because gradient = 0 at local minima, so w cannot change after that

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what does increasing the learning rate do

A

it increases the rate at which w converges, too high of an eta can cause the function to repeatedly overshoot the local minima

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is the purpose of loss functions

A

different loss functions encode the penalty for prediction when the true value is yᵢ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is binary loss function

A

the product of f(x,w) and yᵢ
if the sign is positive, classification is correct and and loss is 0, if sign is negative then classification is wrong and loss is 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

binary loss is not differentiable, what are 2 other possible loss functions to use?

A

hinge loss and exponential loss
hinge loss = max(0, 1 - prediction)
exponential loss = exp(-prediction)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what does the sigmoid function do

A

used for classification, maps pᵢᵀw(which is between ∞ and -∞) to between 1 and -1
f(x,w) = σ(pᵢᵀw)
σ(a) = 1/(1 + e⁻ª)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly