Chapter 6: Optimisation Flashcards

1
Q

we want to optimise….

A

the loss function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is the optimisation theory

A

each derivative of the objective function with respect to each input variable should be 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

delta O(w) =

A
delta O(w)
-----------------    for each w
delta w
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do we optimise the function O(w) to give us w, give the steps

A

derivative of O(w)
set to 0
rearrange to find w

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

give the normal equation for MAIP (Y)

A

Y = W^T Xtilde

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

give the least squares equation for O(w)

A

1/2 sum(Y-y)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

give delta O(w) for least squares

A

Xtilde+ Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

give the weights in the normal equation for least squares

A

W = Xtilde+Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Give the l2 regularised least squares model equation

A

W = ( Xt^TXt + lamda * I ) ^-1 * Xt^T * Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what do we add to regularise the normal equation for the least squares model

A

lamda * Identity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is gradient descent

A

apply a change to minimise the gradient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what do we use to optimise non linear models

A

gradient descent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

how is ‘change’ defined

A

O(t+1) = O(t) + change(Ot)

change(o) = - learningrate delta O(o)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is the learning rate

A

determines how much we change in relative to gradient

determines how many iterations before we reach 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what happens if the learning rate is too low or too high

A

it will take an unnecessary number of iterations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is sequential (online) training

A

update model parameters after just one training sample

17
Q

what is stochastic gradient descent

what does it avoid

A

updates based on just on training sample

avoids overfitting

18
Q

what is mini batch gradient descent

A

estimates the gradient of the error using just a small set of training samples

19
Q

compare gradient descent and stochastic gradient descent

A
GD = faster, requires more data, can be parallelised
SGD = more iterations, only one training sample, not optimised in parallel