Chapter 6: Optimisation Flashcards
we want to optimise….
the loss function
what is the optimisation theory
each derivative of the objective function with respect to each input variable should be 0
delta O(w) =
delta O(w) ----------------- for each w delta w
How do we optimise the function O(w) to give us w, give the steps
derivative of O(w)
set to 0
rearrange to find w
give the normal equation for MAIP (Y)
Y = W^T Xtilde
give the least squares equation for O(w)
1/2 sum(Y-y)^2
give delta O(w) for least squares
Xtilde+ Y
give the weights in the normal equation for least squares
W = Xtilde+Y
Give the l2 regularised least squares model equation
W = ( Xt^TXt + lamda * I ) ^-1 * Xt^T * Y
what do we add to regularise the normal equation for the least squares model
lamda * Identity
what is gradient descent
apply a change to minimise the gradient
what do we use to optimise non linear models
gradient descent
how is ‘change’ defined
O(t+1) = O(t) + change(Ot)
change(o) = - learningrate delta O(o)
what is the learning rate
determines how much we change in relative to gradient
determines how many iterations before we reach 0
what happens if the learning rate is too low or too high
it will take an unnecessary number of iterations
what is sequential (online) training
update model parameters after just one training sample
what is stochastic gradient descent
what does it avoid
updates based on just on training sample
avoids overfitting
what is mini batch gradient descent
estimates the gradient of the error using just a small set of training samples
compare gradient descent and stochastic gradient descent
GD = faster, requires more data, can be parallelised SGD = more iterations, only one training sample, not optimised in parallel