we are so close let's go Flashcards
norm stuff:
||.|| means ||x||>=0 and ||x||=0 iff x=0
||λx|| = |λ|||x||
||x+y||<=||x||+||y||
||x||1 -> sum of |xi|s
||x||2 -> root of sum of squares of xi (square is in the sum, root is outside)
infinity one -> max of |xi|s
gradient descent:
x(k+1)=xk+akpk, pk=-f’(xk)
any norms in here refer to 2-norms (||x||2)
if it asks for improvement, stochastic gradient descent cause you don’t calc every gradient
descent direction:
smth (p?) is a descent direction if ⟨pk,f’(xk)⟩<0
optimal step a for descent algo:
whatever minimises f(xk+apk)
we don’t often use this we find some sufficient decrease condition instead
armijo goldstein conditions:
f(xk+apk)<=f(xk)+ca⟨f’(xk),pk⟩, c in (0,1)
f(xk+apk)>=f(xk)+(1-c)ak⟨f’(xk),pk⟩
convergence:
linear -> ||x(k+1)-x||<=r||xk-x||
superlinear -> as k approaches infinity, ||x(k+1)-x||/||xk-x||=0
order p -> ||x(k+1)-x||<=M||xk-x||^p, when p=2 it’s quadratic
newton’s method:
x(k+1)=xk-f’‘(xk)^(-1)f’(xk)
can converge to max or saddle points as well as min, to check we want f’‘(x*)>0
downside, computing hessians Expensive
supervised learning:
goal is to learn a function h:X->Y, where X is an input and Y is an output (and both are written fancy), given a set of input-output pairs (xi,yi)
pose it as minimise (1/N)sum(1->N)(l(h(xi),yi)) given a class of functions F (cause h in F) and a loss function l:YxY->R+
some uses, handwriting recognition, linear regression, text classification
set is convex when:
λx+(1-λ)y in C
if C, D convex, C intersection D and C+D are convex, and AC is convex where A is some real matrix
function is convex when:
f(λx+(1-λ)y)<=λf(x)+(1-λ)f(y)
f’‘(x) is positive semidefinite (v^(T)f’‘(x)v>=0)
linear programming format:
maximise ⟨c,x⟩ subject to Ax<=b
can replace with subject to Ax=b, x>=0
feasible set:
set of all possible candidates for a lin prog or convex opt problem
farkas’ lemma:
there exists x such that Ax=b, x>=0 iff there does Not exist y such that A^(T)y>=0, ⟨y,b⟩<0
also works for Ax<=b w/ A^(T)y=0, y>=0, and ⟨y,b⟩<0
linear program dual:
primal is max ⟨c,x⟩ subject to Ax<=b
dual is min ⟨b,y⟩ subject to A^(T)y=c, y>=0
the optimal value (so the inner product things) are the same
lagrange dual:
φ(λ,μ)=inf(L(x,λ,μ)) (L thing is f(x)+λg(x)+μh(x), where f is the thing being min/maxed, g is the <=0 thing, and h is the =0 thing)