Week 8 Flashcards
Optimization
What are the 3 pillars of machine learning?
- Linear Algebra
- Probability
- Optimization
what does linear algebra pillar say?
Structure / Relationship between data
what does probability pillar say?
Nose / Uncertainity
what does the optimization pillar say?
finding the best fit
What is the general form of an optimization problem?
we have a variable / parameter (x belongs to Rd) to which we need the minimum value and an objective function f (x) and constraints function gi(x)<=0 for all i =1,…,k (symbolizes all less than or equal to constraints - aka inequality constraints) and hj(x)=0 for all j=1,…,L (symbolizes all equal to constraints - aka equality constraints)
how to solve an unconstrained optimisation problem?
we need to start at arbitrary value for x0 and choose a proper step size to converge like (1/t,1/t+1,1/t+2….)
where does gradient descent algorithm converge to ?
Local Minimum (NOT Global Minimum) . In most optimization problem, Local Minimum is sufficient and good.
What are convex functions?
Local minimum is also the global minimum of the function
What is a Taylor series? what does it say?
Taylor’s series allows us to find the value of a function at point (x+d) given we know f(x) and all derivatives of f(x).
That is, if we have complete information about the derivatives at a point, we could find out the values of the function at any other point
Why is Taylor’s series useful in optimization?
given n (eta) to be small.. we could ignore the higher order terms (2nd derivative and upwards) and approximate f(x+nd) to f(x) +ndf’(x)
f(x+nd) - f(x) = ndf’(x)
f(x+nd) - f(x) should be <0 if the function value is expected to decrease. This implies that ndf’(x) should be less than zero. since n and f’(x) are fixed, we need to find d such that the entire term is negative. selecting d to be -f’(x) negative slope ensures that we select a function of x that is sure to reduce the value if moved on that side
how to use gradient descent for multivariate functions?
In multivariate function , derivative = Gradient (vector of partial derivatives)
If we move in the opposite direction from gradient, we are sure to move in the direction of steepest descent and we will decrease the value of the function the fastest.
Is the direction of gradient descent always the right direction to choose?
For unconstrained optimization - YES
For constrained optimization - NO. The direction of steepest descent may be eliminated by the constraints.
What is newton’s method for optimization?
Not parametric (n is not available)
update rule for newton’s method is
xn+1 = xn- (f’(x) / f’‘(x))
For higher dimensional functions, it requires computing inverse of Hessian matrix.
Newton method does not work if the Hessian is not invertible.
Newtons method may not converge. For some functions and some starting points, it may enter an infinite cycle.
It may converge to a saddle point instead of local minima
It takes more time per iterations and more computation and memory intensive.