Week 8: Support Vector Machines Flashcards

Question 1

Q

Support Vector Machines (SVM’s)

Answer

A

They operate similarly to linear machines with margins. SVM’s rely on preprocessing data in high dimensions using Kernel functions. A model isn’t trained Instead, optimal weights are calculated. They’re based on Linear Discriminant Functions and operate similarly to Neural Networks.

Question 2

Q

Linear SVM’s

Answer

A

The data must be linearly separable and specifically handle 2 classes. The separating hyperplane is f(x) = w^Tx + w_0 = 0. The optimal margin is such that w^Tx + w_0 \ge 1 \forall x \in class 1 (+1), w^Tx + w_0 \le 1 \forall x \in class 2 (-1)

Question 3

Q

Linearly Separable Case

Answer

A

When samples of different classes can be separated by linear functions.

Question 4

Q

Margin of Hyperplane

Answer

A

The distance of a point from the hyperplane is z = f(x)/||w||. The exact margin can be calculated by using a support vector and the hyperplane weights.

Question 5

Q

Lagrange Multiplier Method

Answer

A

An approach to solve for the weights of the separating hyperplane for a linearly separable case. \mathcal{L}(w, w_0, \lambda) = (1/2)||w||^2 - \sum_{i=1}^N \lambda_i (y_i (w^Tx_i + w_0) - 1), with \lambda_i \ge 0, \foralll i=1,2,…,N

The following dual problem is setup:

\underset{\lambda \ge 0}{\max} \underset{w, w_0}{\min} ((1/2)||w||^2 - \sum_{i=1}^N \lambda_i (y_i (w^Tx_i + w_0) - 1)).

The following partial derivatives must be solved:

\frac{\partial \mathcal{L} (w, w_0, \lambda)}{\partial w} = 0 -> w = \sum_{i=1}^N \lambda_i y_i x_i

\frac{\partial \mathcal{L} (w, w_0, \lambda)}{\partial w_0} = 0 -> \sum_{i=1}^N \lambda_i y_i = 0

Question 6

Q

Non-Separable Case

Answer

A

This is when the data isn’t linearly separable. Let’s assume there are 2 classes. There are 3 categories of input samples.

Samples that fall outside the band and are correctly classified. y_i(w^Tx_i + w_0) \ge 1
Samples that fall inside the band and are correctly classified. 0 \ge y_i(w^Tx_i +w_0) < 1
Samples that are misclassified. y_i(w^Tx_i + w_0) < 0

Let’s introduce a slack variable \xi_i. Then y_i(w^Tx_i + w_0) \ge 1 - \xi_i

Samples that fall outside the band and are correctly classified. \xi_i=0
Samples that fall inside the band and are correctly classified. 0 < \xi_i \le 1
Samples that are misclassified. \xi_i > 1

For the second and third categories is obtained as \xi_i = 1 - y_i(w^Tx_i + w_0). We want to maximise the margin and minimise the number of misclassified points (minimise margin violations). As such, \underset{w, w_0, \xi}{\min} J(w, \xi) = (1/2)||w||^2 + C \sum_{i=1}^N \xi_i, such that y_i(w^Tx_i + w_0) \ge 1 - \xi_i, i = 1,2,…,N, \xi_i \ge 0, i = 1,2,…,N, and 0 \le C \le +\infty is a pre-set constant scalar that controls the influence of 2 competing terms.

The Primal problem is thus \mathcal{L}(w,w_0,\xi,lambda,\mu) = (1/2)||w||^2+ C\sum_{i=1}^N \xi_i - \sum_{i=1}^N \mu_i \xi_i - \sum_{i=1}^N \lambda_i(y_i(w^Tx_i + w_0) - 1 + \xi_i)), with \mu_i \ge 0, \lambda_i \ge 0. The dual problem is \underset{w,w_0, \xi}{\min} \underset{\lambda \ge 0, \mu \ge 0}{\max} \mathcal{L}(w, w_0, \xi, \lambda, \mu). Find the partial derivatives for w, w_0, and \xi_i and set them to 0 and solve for the systems of equation.

After doing all this, the dual problem is reduced to \underset{\lambda \ge 0}{\max} (\sum_{i=1}^N \lambda_i - (1/2)\sum_{i=1}^N \sum_{j=1}^N \lambda_i \lambda_j y_i y_j x_i^T x_j)

Question 7

Q

Nonlinear SVM’s

Answer

A

We can perform transformations using feature mapping: z_i = \Phi(x_i). The result transformation can alter the feature space to make the data linearly separable. \sum_{i \in SV’s}\lambda_i y_i \Phi(x_i)^T \Phi(x) + w_0 = 0

Question 8

Q

Hard Classifier

Answer

A

f(x) = sgn(w^Tx + w_0). Note that w^T and x can be transformed.

Question 9

Q

Soft Classifier

Answer

A

f(x) = h(w^Tx + w_0). Note that w^T and x can be transformed.

Question 10

Q

Kernel-based Nonlinear SVM’s

Answer

A

Certain kernels that satisfy Mercer’s Theorem can map the data to a high-dimensional feature space implicitly. K(x_i, x) = \Phi (x_i)^T \Phi(x). They reduce the amount of computation.

Question 11

Q

Linear Kernel

Answer

A

K(x_i, x) = x_i^Tx+c, with c as an optional constant

Question 12

Q

Polynomial kernel of Degree q:

Answer

A

K(x_i, x) = (\alpha x_i^T x + c)^q, with q > 0

Question 13

Q

Radial Basis Function (RBF)

Answer

A

K(x_i, x)= e^{-\frac{||x_i - x||^2}{2 \sigma^2}}

Question 14

Q

Multi-quadratic Kernel

Answer

A

K(x_i, x) = \sqrt{||x_i - x||^2 + c}

Question 15

Q

Inverse Multi-quadratic Kernel

Answer

A

K(x_i, x) = \frac{1}{\sqrt{||x_i - x||^2 + c}

Question 16

Q

Power Kernel

Answer

Study These Flashcards

A

K(x_i, x) = - ||x_i - x||^q

Question 17

Q

Log Kernel

Answer

Study These Flashcards

A

K(x_i, x) = - \log(||x_i - x||^q + 1)

Question 18

Q

Sigmoid Function (Hyperbolic Tangent)

Answer

Study These Flashcards

A

K(x_i, x) = \tanh(\beta x_i^T x + \gamma)

Question 19

Q

Other Kernels

Answer

Study These Flashcards

A

Examples include a linear combination of kernels or a product of kernels.

Question 20

Q

Multi-class SVM Classifiers

Answer

Study These Flashcards

A

These are built by combining 2-class classifiers. Can handle more than 2 classes.

Question 21

Q

One-against-one

Answer

Study These Flashcards

A

Binary classifiers for each pair of classes are calculated. Has R(R-1)/2 total classifiers. Uses a majority vote to classify a point, with all SVM’s consulted.

Pros:
- Training is faster for each SVM since the training set is smaller compared with the whole dataset.
- Can flexibly add or remove classes. For adding classes, existing SVM’s are unaffected, only need trained SVM’s for new classes. For removing classes, the SVM’s for the corresponding classes can simply be removed.

Cons:
- The number of SVM’s is larger compared with other approaches.
- Many SVM’s have to be evaluated to make the final decision.
- The class of some regions of the feature space can’t be determined.

Question 22

Q

One-against-all

Answer

Study These Flashcards

A

For each class, a SVM is calculated to separate that class from the rest of the data. There are a total of R SVM’s, with all of them consulted.

Pros:
- The number of SVM’s to be trained is relatively few.
- Fewer SVM’s have to be evaluated to make the final decision.

Cons:
- The training time may be longer as the whole dataset is used for training each SVM.
- When adding/removing classes, all SVM’s have to be retrained.
- The class of some regions of the feature space can’t be determined.

Question 23

Q

Binary Decision Tree

Answer

Study These Flashcards

A

At each node of the binary tree, split the remaining classes in half. Creates R-1 SVM’s, consults ceil(\log_2 R) SVM’s.

Pros:
- Number of SVM’s to be trained is relatively few.
- When adding classes, only some SVM’s need to be trained. The other existing SVM’s can be re-used.
- Number of SVM’s that need to be evaluated to make a decision is further reduced.
- Training time is slightly faster since not all SVM’s use the whole training dataset. Smaller training sets are used as the algorithm descends the decision tree.

Cons:
- When removing classes, all SVM’s need to be re-trained.
- The classification performance heavily depends on the SVM’s in the upper level. Classification error will propagate.

Question 24

Q

Binary Code

Answer

Study These Flashcards

A

This method splits the total collection of classes for each SVM. Trains a total of ceil(\log_2 R), and consults all SVM’s. Each class has a different combination of SVM outputs.

Pros:
- Number of SVM’s to be trained is further reduced.
- Number of SVM’s to be evaluated making a decision is further reduced.

Cons:
- It may lead to undefined classes if the number of classes isn’t exactly 2^{Number of SVM’s}.
- Training time may be longer as the whole dataset is used for training for each SVM.
- When adding/removing classes, all SVM’s have to be re-trained.

Week 8: Support Vector Machines Flashcards

(24 cards)