Support Vector Machines Flashcards
Define a linear classifier margin
the width that the boundary could be increased by before touching a datapoint
What is the margin of the Linear SVM
The maximum margin
What are support vectors
The datapoints that the margin pushes up against
Advantages of Maximum Margin
- if the boundary is marginally misplaced, this gives us the least chance of misclassification
- Empirically, this works very well
- model is immune to removal of any non-support-vector data points
The hyperplane: wx + b = 0, is fully determined by ?
(w, b)
w = Weight Vector, b = bias term
w is perpendicular to the plus and minus planes, how does this help us calculate the margin width?
We know that the distance between two points on opposite planes will be w multiplied by a constant
w . x+ + b = +1
w . x- + b = -1
x- = x+ + λw
|x+ - x-| = M
M = 2 / ||w||
Problems with Maximum Margin
- The solution can change drastically if there is an outlier
- no solution if the classes are not linearly separable
What is the general idea of the Soft Margin SVM
- “Relax” the formulation to allow points to be on the “wrong” side.
- Penalize points according to how far they are on the wrong side
How well does Soft Margin SVM do on unseen data?
- Depends on the training error and the number of support vectors
- When the number of support vectors is small, we can be sure that the generalization error is not much higher than the training error
What is a vector
an object that has both a magnitude and a direction
what is a vector’s norm
the magnitude, or length, of a vector.
Denoted ||x||
What is a vector norm and How is it computed
It is the magintude, or length, of a vector.
Calculated using Euclidean norm formula;
Square root of the sum of the squared points
Give the vector that denotes the direction of a vector
W = (cos(θ), cos(α))
Dot product of n-dimensional vectors (formula)
x ⋅ y = SUM(i=0, n) xiyi
What separates data in a) one dimension b) two dimensions c) three dimensions
point
line
plane
Describe the kernel trick used in SVM and what is the mathematical reasoning that makes it work?
Applying a transformation ∅ to all data points before running the algorithm to find non-linear patterns
A function K(⋅, ⋅) is a kernel if there exists a function ∅(⋅) s.t.
K(xi, xj) = ∅(xi) ⋅ ∅(xj)
Why is the kernel trick important?
We get the advantage of using a lot of non-linear features without the computational price
What do SVMs learn and what approach is this called
SVMs learn the discrimination boundary
They are called discriminatory approaches
What are the 3 key ideas of SVMs
- Use optimisation to find solution with few errors
- seek large margin separator to improve generalisation
- use kernel trick to make large feature spaces efficiently (computationally)