Week 8: Support Vector Machines Flashcards
Support Vector Machines (SVM’s)
They operate similarly to linear machines with margins. SVM’s rely on preprocessing data in high dimensions using Kernel functions. A model isn’t trained Instead, optimal weights are calculated. They’re based on Linear Discriminant Functions and operate similarly to Neural Networks.
Linear SVM’s
The data must be linearly separable and specifically handle 2 classes. The separating hyperplane is f(x) = w^Tx + w_0 = 0. The optimal margin is such that w^Tx + w_0 \ge 1 \forall x \in class 1 (+1), w^Tx + w_0 \le 1 \forall x \in class 2 (-1)
Linearly Separable Case
When samples of different classes can be separated by linear functions.
Margin of Hyperplane
The distance of a point from the hyperplane is z = f(x)/||w||. The exact margin can be calculated by using a support vector and the hyperplane weights.
Lagrange Multiplier Method
An approach to solve for the weights of the separating hyperplane for a linearly separable case. \mathcal{L}(w, w_0, \lambda) = (1/2)||w||^2 - \sum_{i=1}^N \lambda_i (y_i (w^Tx_i + w_0) - 1), with \lambda_i \ge 0, \foralll i=1,2,…,N
The following dual problem is setup:
\underset{\lambda \ge 0}{\max} \underset{w, w_0}{\min} ((1/2)||w||^2 - \sum_{i=1}^N \lambda_i (y_i (w^Tx_i + w_0) - 1)).
The following partial derivatives must be solved:
\frac{\partial \mathcal{L} (w, w_0, \lambda)}{\partial w} = 0 -> w = \sum_{i=1}^N \lambda_i y_i x_i
\frac{\partial \mathcal{L} (w, w_0, \lambda)}{\partial w_0} = 0 -> \sum_{i=1}^N \lambda_i y_i = 0
Non-Separable Case
This is when the data isn’t linearly separable. Let’s assume there are 2 classes. There are 3 categories of input samples.
- Samples that fall outside the band and are correctly classified. y_i(w^Tx_i + w_0) \ge 1
- Samples that fall inside the band and are correctly classified. 0 \ge y_i(w^Tx_i +w_0) < 1
- Samples that are misclassified. y_i(w^Tx_i + w_0) < 0
Let’s introduce a slack variable \xi_i. Then y_i(w^Tx_i + w_0) \ge 1 - \xi_i
- Samples that fall outside the band and are correctly classified. \xi_i=0
- Samples that fall inside the band and are correctly classified. 0 < \xi_i \le 1
- Samples that are misclassified. \xi_i > 1
For the second and third categories is obtained as \xi_i = 1 - y_i(w^Tx_i + w_0). We want to maximise the margin and minimise the number of misclassified points (minimise margin violations). As such, \underset{w, w_0, \xi}{\min} J(w, \xi) = (1/2)||w||^2 + C \sum_{i=1}^N \xi_i, such that y_i(w^Tx_i + w_0) \ge 1 - \xi_i, i = 1,2,…,N, \xi_i \ge 0, i = 1,2,…,N, and 0 \le C \le +\infty is a pre-set constant scalar that controls the influence of 2 competing terms.
The Primal problem is thus \mathcal{L}(w,w_0,\xi,lambda,\mu) = (1/2)||w||^2+ C\sum_{i=1}^N \xi_i - \sum_{i=1}^N \mu_i \xi_i - \sum_{i=1}^N \lambda_i(y_i(w^Tx_i + w_0) - 1 + \xi_i)), with \mu_i \ge 0, \lambda_i \ge 0. The dual problem is \underset{w,w_0, \xi}{\min} \underset{\lambda \ge 0, \mu \ge 0}{\max} \mathcal{L}(w, w_0, \xi, \lambda, \mu). Find the partial derivatives for w, w_0, and \xi_i and set them to 0 and solve for the systems of equation.
After doing all this, the dual problem is reduced to \underset{\lambda \ge 0}{\max} (\sum_{i=1}^N \lambda_i - (1/2)\sum_{i=1}^N \sum_{j=1}^N \lambda_i \lambda_j y_i y_j x_i^T x_j)
Nonlinear SVM’s
We can perform transformations using feature mapping: z_i = \Phi(x_i). The result transformation can alter the feature space to make the data linearly separable. \sum_{i \in SV’s}\lambda_i y_i \Phi(x_i)^T \Phi(x) + w_0 = 0
Hard Classifier
f(x) = sgn(w^Tx + w_0). Note that w^T and x can be transformed.
Soft Classifier
f(x) = h(w^Tx + w_0). Note that w^T and x can be transformed.
Kernel-based Nonlinear SVM’s
Certain kernels that satisfy Mercer’s Theorem can map the data to a high-dimensional feature space implicitly. K(x_i, x) = \Phi (x_i)^T \Phi(x). They reduce the amount of computation.
Linear Kernel
K(x_i, x) = x_i^Tx+c, with c as an optional constant
Polynomial kernel of Degree q:
K(x_i, x) = (\alpha x_i^T x + c)^q, with q > 0
Radial Basis Function (RBF)
K(x_i, x)= e^{-\frac{||x_i - x||^2}{2 \sigma^2}}
Multi-quadratic Kernel
K(x_i, x) = \sqrt{||x_i - x||^2 + c}
Inverse Multi-quadratic Kernel
K(x_i, x) = \frac{1}{\sqrt{||x_i - x||^2 + c}
Power Kernel
K(x_i, x) = - ||x_i - x||^q
Log Kernel
K(x_i, x) = - \log(||x_i - x||^q + 1)
Sigmoid Function (Hyperbolic Tangent)
K(x_i, x) = \tanh(\beta x_i^T x + \gamma)
Other Kernels
Examples include a linear combination of kernels or a product of kernels.
Multi-class SVM Classifiers
These are built by combining 2-class classifiers. Can handle more than 2 classes.
One-against-one
Binary classifiers for each pair of classes are calculated. Has R(R-1)/2 total classifiers. Uses a majority vote to classify a point, with all SVM’s consulted.
Pros:
- Training is faster for each SVM since the training set is smaller compared with the whole dataset.
- Can flexibly add or remove classes. For adding classes, existing SVM’s are unaffected, only need trained SVM’s for new classes. For removing classes, the SVM’s for the corresponding classes can simply be removed.
Cons:
- The number of SVM’s is larger compared with other approaches.
- Many SVM’s have to be evaluated to make the final decision.
- The class of some regions of the feature space can’t be determined.
One-against-all
For each class, a SVM is calculated to separate that class from the rest of the data. There are a total of R SVM’s, with all of them consulted.
Pros:
- The number of SVM’s to be trained is relatively few.
- Fewer SVM’s have to be evaluated to make the final decision.
Cons:
- The training time may be longer as the whole dataset is used for training each SVM.
- When adding/removing classes, all SVM’s have to be retrained.
- The class of some regions of the feature space can’t be determined.
Binary Decision Tree
At each node of the binary tree, split the remaining classes in half. Creates R-1 SVM’s, consults ceil(\log_2 R) SVM’s.
Pros:
- Number of SVM’s to be trained is relatively few.
- When adding classes, only some SVM’s need to be trained. The other existing SVM’s can be re-used.
- Number of SVM’s that need to be evaluated to make a decision is further reduced.
- Training time is slightly faster since not all SVM’s use the whole training dataset. Smaller training sets are used as the algorithm descends the decision tree.
Cons:
- When removing classes, all SVM’s need to be re-trained.
- The classification performance heavily depends on the SVM’s in the upper level. Classification error will propagate.
Binary Code
This method splits the total collection of classes for each SVM. Trains a total of ceil(\log_2 R), and consults all SVM’s. Each class has a different combination of SVM outputs.
Pros:
- Number of SVM’s to be trained is further reduced.
- Number of SVM’s to be evaluated making a decision is further reduced.
Cons:
- It may lead to undefined classes if the number of classes isn’t exactly 2^{Number of SVM’s}.
- Training time may be longer as the whole dataset is used for training for each SVM.
- When adding/removing classes, all SVM’s have to be re-trained.