Perceptron Optimisation and Gradient Descent Flashcards
Explain local optimisation
If you graph the error for all values of a single weight and you want to reduce the error, you can use the gradient to decide which direction to reduce the error.
Explain zero order gradient descent method for optimization
This method is essentially taking any step towards the goal regardless of the gradient of the slope.
Mathematically you know the gradient is up or down but not the magnitude so you take any step down.
Explain first order gradient descent method for optimisation
You kow the gradient of the slope so you take the step that has the greatest descent towards your goal.
How useful is zero order gradient descent? Why?
> Not effective
> Slow
What is the equation for first order gradient descent?
x_(t+1) = x_t - η∇f(x_t)
Explain the usefullness and benefits of second order gradient descent
> Extremely computationally intensive when you have a lot of varaibles
The benefit is that it takes a single (idillic) step in the right direction.
Often easier to just take lots of small steps
When do you stop with gradient descent?
When the change in the gradient is less than a threshold
What is the learning rate?
This is the step size
What is the effect of the size of the learning rate?
If the step is too large you can travel too far and overshoot the minimum error.
This can cause diversion
What is the effect of local minima?
There are multiple values each with a different minma. Gradient descent does not take this into account so it will converge on a local minima which may not be the smallest minima.
What can we do to prevent local minima?
We can restart learning with random initial weights and it will converge on a different minima
What is the issue with calculating gradient using E(X) = ∑ | y_n - t_n | ?
The equation (E(X) = ∑ | y_n - t_n |) is pievewise constant. So even if we improve the classification, if there is no change in error then it appears not to improve. The equation is not linearly differentiable so it has no gradient.
How do we measure error where the value is continuous even if the number of misclassified points does not change?
By measuring how close each point is to the decision boundary
What is the equation for calculating the distance of the misclassified point ot he decision boudnary
> x = x_p + d×w/||w|| > (x - x_p)||w|| / w = d > d = distance to hyper place > w = vector perpendicular to the hyper plane > x_p = point on hyper plane > x = misclassified point
What is the new equation to calculate the total distance of all misclassified points?
E(X) = ∑ (w^T x_n + w_0)