Chapter 6: Optimisation Flashcards

Question 1

Q

we want to optimise….

Answer

A

the loss function

Question 2

Q

what is the optimisation theory

Answer

A

each derivative of the objective function with respect to each input variable should be 0

Question 3

Q

delta O(w) =

Answer

A

delta O(w)
-----------------    for each w
delta w

Question 4

Q

How do we optimise the function O(w) to give us w, give the steps

Answer

A

derivative of O(w)
set to 0
rearrange to find w

Question 5

Q

give the normal equation for MAIP (Y)

Answer

A

Y = W^T Xtilde

Question 6

Q

give the least squares equation for O(w)

Answer

A

1/2 sum(Y-y)^2

Question 7

Q

give delta O(w) for least squares

Answer

A

Xtilde+ Y

Question 8

Q

give the weights in the normal equation for least squares

Answer

A

W = Xtilde+Y

Question 9

Q

Give the l2 regularised least squares model equation

Answer

A

W = ( Xt^TXt + lamda * I ) ^-1 * Xt^T * Y

Question 10

Q

what do we add to regularise the normal equation for the least squares model

Answer

A

lamda * Identity

Question 11

Q

what is gradient descent

Answer

A

apply a change to minimise the gradient

Question 12

Q

what do we use to optimise non linear models

Answer

A

gradient descent

Question 13

Q

how is ‘change’ defined

Answer

A

O(t+1) = O(t) + change(Ot)

change(o) = - learningrate delta O(o)

Question 14

Q

what is the learning rate

Answer

A

determines how much we change in relative to gradient

determines how many iterations before we reach 0

Question 15

Q

what happens if the learning rate is too low or too high

Answer

A

it will take an unnecessary number of iterations

Question 16

Q

what is sequential (online) training

Answer

Study These Flashcards

A

update model parameters after just one training sample

Question 17

Q

what is stochastic gradient descent

what does it avoid

Answer

Study These Flashcards

A

updates based on just on training sample

avoids overfitting

Question 18

Q

what is mini batch gradient descent

Answer

Study These Flashcards

A

estimates the gradient of the error using just a small set of training samples

Question 19

Q

compare gradient descent and stochastic gradient descent

Answer

Study These Flashcards

A

GD = faster, requires more data, can be parallelised
SGD = more iterations, only one training sample, not optimised in parallel

Chapter 6: Optimisation Flashcards

(19 cards)