Perceptron Optimisation and Gradient Descent Flashcards
Explain local optimisation
If you graph the error for all values of a single weight and you want to reduce the error, you can use the gradient to decide which direction to reduce the error.
Explain zero order gradient descent method for optimization
This method is essentially taking any step towards the goal regardless of the gradient of the slope.
Mathematically you know the gradient is up or down but not the magnitude so you take any step down.
Explain first order gradient descent method for optimisation
You kow the gradient of the slope so you take the step that has the greatest descent towards your goal.
How useful is zero order gradient descent? Why?
> Not effective
> Slow
What is the equation for first order gradient descent?
x_(t+1) = x_t - η∇f(x_t)
Explain the usefullness and benefits of second order gradient descent
> Extremely computationally intensive when you have a lot of varaibles
The benefit is that it takes a single (idillic) step in the right direction.
Often easier to just take lots of small steps
When do you stop with gradient descent?
When the change in the gradient is less than a threshold
What is the learning rate?
This is the step size
What is the effect of the size of the learning rate?
If the step is too large you can travel too far and overshoot the minimum error.
This can cause diversion
What is the effect of local minima?
There are multiple values each with a different minma. Gradient descent does not take this into account so it will converge on a local minima which may not be the smallest minima.
What can we do to prevent local minima?
We can restart learning with random initial weights and it will converge on a different minima
What is the issue with calculating gradient using E(X) = ∑ | y_n - t_n | ?
The equation (E(X) = ∑ | y_n - t_n |) is pievewise constant. So even if we improve the classification, if there is no change in error then it appears not to improve. The equation is not linearly differentiable so it has no gradient.
How do we measure error where the value is continuous even if the number of misclassified points does not change?
By measuring how close each point is to the decision boundary
What is the equation for calculating the distance of the misclassified point ot he decision boudnary
> x = x_p + d×w/||w|| > (x - x_p)||w|| / w = d > d = distance to hyper place > w = vector perpendicular to the hyper plane > x_p = point on hyper plane > x = misclassified point
What is the new equation to calculate the total distance of all misclassified points?
E(X) = ∑ (w^T x_n + w_0)
What is the new function for the error that is differentiable?
E(X) = ∑ w^T x_g (y_n - t_n)
What is the equation of the gradient of the new error function?
E(X) = ∑ x_g (y_n - t_n)
What is the gradient descent function with the new measurement of error?
w_k+1 = w_k - η × ∑ x_g (y_n - t_n)
What is the issue of using this equation ( w_k+1 = w_k - η × ∑ x_g (y_n - t_n) ) for gradient descent?
It requires a lot of computation because eac h point needs to be calculated
How do you reduce the amount of computation for gradient descent?
Using stchastic gradient descent
What is stchastic gradient descent?
Instead of minimising the total error, we randomly select points and minimise the average error of those points
What is the equation for calculating the average error?
E(X) = 1/N ∑w^T x_n (y_n-t_n )
What is the equation for stchastic gradient descent?
w_(k+1) = w_k - ηx(y - t)
What is the benefit of stchastic gradient descent? What is the trade off? What will happen overall?
Benefit: It uses much less computational power
Trade off: Sometimes the error can increase
Overall: The error will trend downward