GD, SGD, and SVM Flashcards
Gradient Descent
General approach for minimizing a differentiable convex function f(w).
Measures the local gradient of the error function with regards to the parameter vector theta, and it goes in the direction of descending gradient.
GD algorithm
w(0) <– 0
for t <– 0 to T-1 do
- w(t+1) <– w(t) - eta grad(f(w(t)))
return barw = 1/T SUM w(t)
Stochastic Gradient Descent
instead of using exactly the gradient we take a (random) vector with expected value equal to the gradinet direction.
SGD algorihm
w(0) <– 0
for t <– 0 to T-1 do
- choose vt at random from distribution such that E[vt|w(t)] in grad(f(w(t)))
/// vt has expected value equal to the gradient of f(w(t))
- w(t+1) <– w(t) - etavt
return barw = 1/T SUM w(t)
Hard-SVM optimization problem
Learning rule in which we return an ERM hyperplane that separate the training set with the largest possible margin. (only for linearly separable data.
Computational problem:
argmax ( min |<w,xi> + b |)
Hard-SVM: equivalent formulation, and quadratic formulation
Due to separability assumption :
argmax ( min yi ( <w,xi> + b ) )
Definition of support vectors for Hard-SVM
Support vectors = vectors at minimimum distance from w0
the are the only ones that matter for defining w0
Soft-SVM optimization problem, hinge loss
Used to train not linearly separable. It allows to have some points inside the margin or wrongly classified.
yi(<w,xi> +b) >=1
Soft-SVM constraints:
- slack variables: eta
- for each i : yi(<w,xi> +b) >=1 - eta
- eta how much constraint yi(<w,xi> +b) >=1 is violated
Soft-SVM minimizes combinations of
- norm of w (margin)
- average of etai (violation of the constraints)
Soft-SVM optimization problem
input (x1,y1)…(xm,ym), parameter lambda > 0
solve:
min(lambda||w||^2 +1/m SUM etai)
subject to for all i yi(<w,xi> +b) >=1 - eta and etai >= 0
output: w,b
Hinge Loss: it is the equivalent formulation
Ls(hinge) = 1/m SUM lhinge ((w,b),(xi,yi))
Hinge formulation
min(lambda ||w||^2 + Lshinge(w,b))
SGD for solving soft-SVM
We want to solve
min(lambda/2||w||^2 + 1/m SUM max {0,1 - y<w,xi>})
Note: it’s the standard to add a 1/2 in the regularization term to semplify some computation
Hard-SVM dual formulation