Kursusgang 7 (Support vector machines) Flashcards
What is a support vector machine?
A linear classifier using supervised learning. The key ideas include maximal margin, having the larget possible length between two parallel lines without either making a classification error.
What is a support vector?
The support vectors are the points that lie on the margin edges, the parallel lines in a support vector machine.
In the linearly seperable case, what does the support vector algorithm do?
It looks for the separating hyperplane with the largest margin.
What are the pros and cons of support vector machines?
Works very well.
Fool proof as there are only a few kernels to choose from and very few hand-crafted parameters.
Slow training because of quadratic programming
Slow testing which depends on the number of support vectors (data points on the margin)
Curse of sample size.
What methods can handle non-linear-separable data for support vector machines?
Nonlinear classifier and increasing the dimension
* Potentially complicated
Kernel trick
* Limited to support vector machines, due to it only being applicable to algorithms using multiplication between vectors.
What is the kernel trick?
Training vectors x_i are mapped into a higher (maybe infinite) dimensional space by the function ø. Then SVM finds a linear separating hyperplane in this higher dimensional space.
The kernel is
K(x_i, x_j) = ø(x_i)^T x(x_j)
Place it into the decision function, instead of x_i^T * x.
What are the basic kernels for the kernel trick?
Linear:
K(x_i, x_j) = x_i^T * x
Polynomial:
K(x_i, x_j) = (ς * x_i^T * x_j +c)^d, ς > 0
Radial basis function:
K(x_i, x_j) = exp(-ς ||x_i - x_j||^2), ς > 0
Sigmoid:
K(x_i, x_j) = tanh(ς * x_i^T * x_j + r)
What are the pros and cons of support vector machines?
Works very well.
Fool proof as there are only a few kernels to choose from and very few hand-crafted parameters.
Slow training because of quadratic programming
Slow testing which depends on the number of support vectors (data points on the margin)
Curse of sample size.
Write out the decision function of the support vector machine and explain it.
It is a mathematical function that takes input data and calculates a decision boundary, often represented as a hyperplane, to separate data points into distinct classes. The decision function’s objective is to maximize the margin between the classes while minimizing classification errors.
f(x) = w^t * x + b = \sum_{i=1}^N α_i * y_i * x_i^T * x + b
x: The input feature vector for which the prediction is being made.
Example: A data point from the test set.
w: The weight vector that defines the orientation of the hyperplane in the feature space.
In the dual form, w is expressed as a weighted sum of the support vectors.
b: The bias term (or offset).
It determines the position of the hyperplane relative to the origin.
α_i: The Lagrange multipliers (or dual coefficients).
These are determined during the optimization process in the dual form. Nonzero values of α_i correspond to the support vectors, the data points closest to the hyperplane.
y_i: The label of the i-th training sample.
Typically, y_i∈{−1, +1} for binary classification.
x_i^T * x: The dot product between the training sample x_i and the input vector x. In the kernelized version of SVMs, this term is replaced by a kernel function K(x_i,x).
What is a soft-margin support vector machine?
If data is not separable, we introduce a penalty term/slack variables (violation of margin constraints),
min_{w,b,ξ_{1_N}},
such that for all i, y_i(w^T Φ(x_i) + b) ≥ 1 - ξ_i and ξ_i ≥ 0.