Week 8: Support Vector Machines Flashcards
Support Vector Machines (SVM’s)
They operate similarly to linear machines with margins. SVM’s rely on preprocessing data in high dimensions using Kernel functions. A model isn’t trained Instead, optimal weights are calculated. They’re based on Linear Discriminant Functions and operate similarly to Neural Networks.
Linear SVM’s
The data must be linearly separable and specifically handle 2 classes. The separating hyperplane is f(x) = w^Tx + w_0 = 0. The optimal margin is such that w^Tx + w_0 \ge 1 \forall x \in class 1 (+1), w^Tx + w_0 \le 1 \forall x \in class 2 (-1)
Linearly Separable Case
When samples of different classes can be separated by linear functions.
Margin of Hyperplane
The distance of a point from the hyperplane is z = f(x)/||w||. The exact margin can be calculated by using a support vector and the hyperplane weights.
Lagrange Multiplier Method
An approach to solve for the weights of the separating hyperplane for a linearly separable case. \mathcal{L}(w, w_0, \lambda) = (1/2)||w||^2 - \sum_{i=1}^N \lambda_i (y_i (w^Tx_i + w_0) - 1), with \lambda_i \ge 0, \foralll i=1,2,…,N
The following dual problem is setup:
\underset{\lambda \ge 0}{\max} \underset{w, w_0}{\min} ((1/2)||w||^2 - \sum_{i=1}^N \lambda_i (y_i (w^Tx_i + w_0) - 1)).
The following partial derivatives must be solved:
\frac{\partial \mathcal{L} (w, w_0, \lambda)}{\partial w} = 0 -> w = \sum_{i=1}^N \lambda_i y_i x_i
\frac{\partial \mathcal{L} (w, w_0, \lambda)}{\partial w_0} = 0 -> \sum_{i=1}^N \lambda_i y_i = 0
Non-Separable Case
This is when the data isn’t linearly separable. Let’s assume there are 2 classes. There are 3 categories of input samples.
- Samples that fall outside the band and are correctly classified. y_i(w^Tx_i + w_0) \ge 1
- Samples that fall inside the band and are correctly classified. 0 \ge y_i(w^Tx_i +w_0) < 1
- Samples that are misclassified. y_i(w^Tx_i + w_0) < 0
Let’s introduce a slack variable \xi_i. Then y_i(w^Tx_i + w_0) \ge 1 - \xi_i
- Samples that fall outside the band and are correctly classified. \xi_i=0
- Samples that fall inside the band and are correctly classified. 0 < \xi_i \le 1
- Samples that are misclassified. \xi_i > 1
For the second and third categories is obtained as \xi_i = 1 - y_i(w^Tx_i + w_0). We want to maximise the margin and minimise the number of misclassified points (minimise margin violations). As such, \underset{w, w_0, \xi}{\min} J(w, \xi) = (1/2)||w||^2 + C \sum_{i=1}^N \xi_i, such that y_i(w^Tx_i + w_0) \ge 1 - \xi_i, i = 1,2,…,N, \xi_i \ge 0, i = 1,2,…,N, and 0 \le C \le +\infty is a pre-set constant scalar that controls the influence of 2 competing terms.
The Primal problem is thus \mathcal{L}(w,w_0,\xi,lambda,\mu) = (1/2)||w||^2+ C\sum_{i=1}^N \xi_i - \sum_{i=1}^N \mu_i \xi_i - \sum_{i=1}^N \lambda_i(y_i(w^Tx_i + w_0) - 1 + \xi_i)), with \mu_i \ge 0, \lambda_i \ge 0. The dual problem is \underset{w,w_0, \xi}{\min} \underset{\lambda \ge 0, \mu \ge 0}{\max} \mathcal{L}(w, w_0, \xi, \lambda, \mu). Find the partial derivatives for w, w_0, and \xi_i and set them to 0 and solve for the systems of equation.
After doing all this, the dual problem is reduced to \underset{\lambda \ge 0}{\max} (\sum_{i=1}^N \lambda_i - (1/2)\sum_{i=1}^N \sum_{j=1}^N \lambda_i \lambda_j y_i y_j x_i^T x_j)
Nonlinear SVM’s
We can perform transformations using feature mapping: z_i = \Phi(x_i). The result transformation can alter the feature space to make the data linearly separable. \sum_{i \in SV’s}\lambda_i y_i \Phi(x_i)^T \Phi(x) + w_0 = 0
Hard Classifier
f(x) = sgn(w^Tx + w_0). Note that w^T and x can be transformed.
Soft Classifier
f(x) = h(w^Tx + w_0). Note that w^T and x can be transformed.
Kernel-based Nonlinear SVM’s
Certain kernels that satisfy Mercer’s Theorem can map the data to a high-dimensional feature space implicitly. K(x_i, x) = \Phi (x_i)^T \Phi(x). They reduce the amount of computation.
Linear Kernel
K(x_i, x) = x_i^Tx+c, with c as an optional constant
Polynomial kernel of Degree q:
K(x_i, x) = (\alpha x_i^T x + c)^q, with q > 0
Radial Basis Function (RBF)
K(x_i, x)= e^{-\frac{||x_i - x||^2}{2 \sigma^2}}
Multi-quadratic Kernel
K(x_i, x) = \sqrt{||x_i - x||^2 + c}
Inverse Multi-quadratic Kernel
K(x_i, x) = \frac{1}{\sqrt{||x_i - x||^2 + c}