we are so close let's go Flashcards
norm stuff:
||.|| means ||x||>=0 and ||x||=0 iff x=0
||λx|| = |λ|||x||
||x+y||<=||x||+||y||
||x||1 -> sum of |xi|s
||x||2 -> root of sum of squares of xi (square is in the sum, root is outside)
infinity one -> max of |xi|s
gradient descent:
x(k+1)=xk+akpk, pk=-f’(xk)
any norms in here refer to 2-norms (||x||2)
if it asks for improvement, stochastic gradient descent cause you don’t calc every gradient
descent direction:
smth (p?) is a descent direction if ⟨pk,f’(xk)⟩<0
optimal step a for descent algo:
whatever minimises f(xk+apk)
we don’t often use this we find some sufficient decrease condition instead
armijo goldstein conditions:
f(xk+apk)<=f(xk)+ca⟨f’(xk),pk⟩, c in (0,1)
f(xk+apk)>=f(xk)+(1-c)ak⟨f’(xk),pk⟩
convergence:
linear -> ||x(k+1)-x||<=r||xk-x||
superlinear -> as k approaches infinity, ||x(k+1)-x||/||xk-x||=0
order p -> ||x(k+1)-x||<=M||xk-x||^p, when p=2 it’s quadratic
newton’s method:
x(k+1)=xk-f’‘(xk)^(-1)f’(xk)
can converge to max or saddle points as well as min, to check we want f’‘(x*)>0
downside, computing hessians Expensive
supervised learning:
goal is to learn a function h:X->Y, where X is an input and Y is an output (and both are written fancy), given a set of input-output pairs (xi,yi)
pose it as minimise (1/N)sum(1->N)(l(h(xi),yi)) given a class of functions F (cause h in F) and a loss function l:YxY->R+
some uses, handwriting recognition, linear regression, text classification
set is convex when:
λx+(1-λ)y in C
if C, D convex, C intersection D and C+D are convex, and AC is convex where A is some real matrix
function is convex when:
f(λx+(1-λ)y)<=λf(x)+(1-λ)f(y)
f’‘(x) is positive semidefinite (v^(T)f’‘(x)v>=0)
linear programming format:
maximise ⟨c,x⟩ subject to Ax<=b
can replace with subject to Ax=b, x>=0
feasible set:
set of all possible candidates for a lin prog or convex opt problem
farkas’ lemma:
there exists x such that Ax=b, x>=0 iff there does Not exist y such that A^(T)y>=0, ⟨y,b⟩<0
also works for Ax<=b w/ A^(T)y=0, y>=0, and ⟨y,b⟩<0
linear program dual:
primal is max ⟨c,x⟩ subject to Ax<=b
dual is min ⟨b,y⟩ subject to A^(T)y=c, y>=0
the optimal value (so the inner product things) are the same
lagrange dual:
φ(λ,μ)=inf(L(x,λ,μ)) (L thing is f(x)+λg(x)+μh(x), where f is the thing being min/maxed, g is the <=0 thing, and h is the =0 thing)
strong duality:
the primal and dual optimal values are the same
KKT conditions:
g(x)<=0
h(x)=0
λ>=0
λigi(x)=0
f’(x)+sum(λigi’(x))+sum(μihi’(x))=0
if these are satisfied there’s 0 duality gap (so strong duality)
central path:
the set of sols when the optimality conditions switch to XSe=τe and x,s>0 (instead of =0 and >=0 respectively), as τ->0 any sol of this approaches the regular solution
a set of solutions x(t) for t>0 that solve minimise tf(x)+Φ(x) subject to Ax=b
a point x is equal to a point x(t) on the central path iff there exist dual multipliers λ, μ* such that -
g(x)<=0
Ax=b
λ>=0
-λigi(x)=1/t
f’(x)+sum(λigi’(x))+A^(T)μ*=0
optimality conditions:
using min ⟨c,x⟩ subject to Ax=b, x>=0 as example, conditions are -
Ax-b=0
A^(T)y+s-c=0 (s is a slack variable cause normally this’d be a <= not a =)
XSe=0
x>=0
s>=0
everything with an s involved is done for s1,…,sn
derivatives by vectors:
x^(T)B -> B
x^(T)b -> b - same as b^(T)x
x^(T)x -> 2x
x^(T)Bx -> 2Bx
basically drop the x^(T) and if there’s also an x, replace it with a 2