Support Vector Machines Flashcards by Martin Matte

What is the main idea behind support vector machine?

An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on the side of the gap on which they fall.

How well did you know this?

Not at all

Perfectly

Why do we call a SVM a large margin classifier?

1) SVM is a type of classifier which classifies positive and negative examples, here blue and red data points
2) As shown in the image, the largest margin is found in order to avoid overfitting ie,.. the optimal hyperplane is at the maximum distance from the positive and negative examples(Equal distant from the boundary lines).
3) To satisfy this constraint, and also to classify the data points accurately, the margin is maximised, that is why this is called the large margin classifier.

How well did you know this?

Not at all

Perfectly

What do we mean by hard margin classification? What are the two main issues ?

When we strickly impose that all instances must be on one side.

The issues are:

1) It only works if the data is linearly separable.
2) It is sensitive to outliers.

How well did you know this?

Not at all

Perfectly

What do we mean by soft Margin?

This idea is based on a simple premise: allow SVM to make a certain number of mistakes and keep margin as wide as possible so that other points can still be classified correctly. This can be done simply by modifying the objective of SVM.

How well did you know this?

Not at all

Perfectly

What is the main idea of the kernel method?

Kernel methods are a class of algorithms for pattern analysis or recognition, The main characteristic of Kernel Methods, however, is their distinct approach to this problem. Kernel methods map the data into higher dimensional spaces in the hope that in this higher-dimensional space the data could become more easily separated or better structured.

How well did you know this?

Not at all

Perfectly

What is the kernel trick ?

We can observe from the picture that the equation depends on the dot product of input vector pairs (x_i,x_j), which is nothing but a kernal function. Now here’s a good thing: we don’t have to be restricted to a simple Kernel function like dot product. We can use any fancy Kernel function in place of dot product that has the capability of measuring similarity in higher dimensions (where it could be more accurate; more on this later), without increasing the computational costs much. This is essentially known as the Kernel Trick.

How well did you know this?

Not at all

Perfectly

How did we create that circle using a kernel function?

1 - Each point P is represented by (x,y) coordinates in 2D space.

2 - We project the points to 3D space by transforming their coordinates to (x^2, y^2, √2xy)

3 - Points which have high value of x.y would move upwards along the z-axis (in this case, mostly the red circles).

4 - We find a hyperplane in 3D space that would perfectly separate the classes.

5 - The form of Kernel function indicates that this hyperplane would form a circle in 2D space, thus giving us a non-linear decision boundary.

By embedding the data in a higher-dimensional feature space, we can keep using a linear classifier! There is a nice visualization of this exact solution here:

https://www.youtube.com/watch?v=3liCbRZPrZA

How well did you know this?

Not at all

Perfectly

What is the goal when training a linear SVM classifier?

Training a linear SVM classifier means finding the values of w (weight vector or parameters) and b (the bias term) that make the margin as wide as possible while avoid margin violation (hard margin) or limiting them (soft margin).

How well did you know this?

Not at all

Perfectly

What is a polynomial kernel

In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models.

How well did you know this?

Not at all

Perfectly

With so many kernels to choose from, how can you decide which one to use?

As a rule of thumb, you should always try the linear kernel first. If the training set is not to large, you should also try the Gaussian RBF kernal.

Some kernels may be specialized for your training set’s data structure. If these two kernal does not worth other kernels might be worth a look.

How well did you know this?

Not at all

Perfectly

According to sklearn, what are the avantage and disavantages of SWM?

The advantages of support vector machines are:

1) Effective in high dimensional spaces.
2) Still effective in cases where number of dimensions is greater than the number of samples.
3) Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
4) Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

1) If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
2) SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).

How well did you know this?

Not at all

Perfectly

What is the linear kernel formula?

The linear kernel is the simplest kernel function. It is given by the inner product plus an optional constant c.

k(x,y)=x^Ty+c

How well did you know this?

Not at all

Perfectly

What is the polynomial kernel formula?

k(x,y) = (ax^Ty+c)^d

The polynomial kernel is a non stationary kernel. Polynomial kernels are well suited for problems where all the training data is normalized. Adjustable parameters are the slope a, the constant term c and the polynomial degree d.

How well did you know this?

Not at all

Perfectly

What is the gaussian kernel formula?

k(x,y) = exp(-ɤ||x-y||²)

The Gaussian kernel is an example of the radial basis function kernel. The adjustable parameter sigma plays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand. If overestimated, the exponential will behave almost linearly and the higher-dimenensional projection will start to lose its non-linear power. In the other hand, if underestimated, the function will lack regularization and the decision boundary will be highly sensitive to noise.

How well did you know this?

Not at all

Perfectly

For the linear SVM classifier, what is the Decision function?

yp {0 if w^Tx+b <0 and 1 if w^Tx+b>= 0}

Note yp is the predicted value, w is the weights vector (parameter vector) and b is the bias term. If the result is positive the predicted class yp is the positive class (1), and otherwise it is the negative class (0).

How well did you know this?

Not at all

Perfectly

What is the fundamental idea (2) behind Support vector machine?

Study These Flashcards

1) The goal of a SVM is to have the largest possible margin between the decision boundary that separates the two classes and the training instances. When performing soft margin classification, the SVM searches for a compromise between perfectly separating the two classes and having the widest possible margin.
2) Another key idea is to use kernels when training on nonlinear datasets.

What is a support vector?

Study These Flashcards

After training an SVM, a support vector is any instance located on the “street” (near the decision boundary), including its border. The decision boundary is entirely determined by the support vectors. Any instance that is not a support vector (i.e., is off the street) has no influence whatsoever.

Computing the predictions only involves the support vector, not the whole training set.

Why is it important to scale the inputs when using SVMs?

Study These Flashcards

SVMs try to fit the largest possible “street” between the classes, so if the training set is not scaled, the SVM will tend to neglect small features.

Can an SVM classifier output a confidence score when it classifies an instance? What about a probability?

Study These Flashcards

An SVM classifier can output the distance between the test instance and the decision boundary, and you can use this as a confidence score. However, this score cannot be directly converted into an estimation of the class probability.

If you set probability = true when creating an SVM in sklearn, then after training it will calibrate the probabilities using logistic regression on the SVM scores (trained by an additional five-fold cross-validation on the training data).

Should you use the primal or the dual form of the SVM problem to train a model on a training set with millions of instances and hundreds of features?

Study These Flashcards

This question applies only to linear SVMs since the kernelized SVMs can only use the dual form. The computational comp;exity of the primal form of the SVM problem is proportional to the number of training instances m, while the computational complexity of the duel form is proportional to a number between m² and m³. So if there are millions of instances, you should definitely use the primal form, because the dual form will be too slow.

Say you’ve trained an SVM classifier with an RBF kernal, but it seem to underfit the training set set. Should you increase or decrease y (gamma)? what about C?

Study These Flashcards

If an SVM classifier trained with an RBF kernel underfits the training set, there might be too much regularization. To decrease it, you need to increase gamma or C (or both)

In a linear SVM, what is the slope equal to and what happen if divide the slope by 2?

Study These Flashcards

The slope is equal to the norm of the weight vector: ||w||

If we divide the slope by 2, the poits where the decision function is equal to +-1 are going to be twice as far away from the decision boundary. In other words, dividing the slope by 2 will multiply the margin by 2 as shown in the figure.

In a linear SVM, what is the cost function for a hard margin classifier?

Study These Flashcards

The goal is to minize ||w|| to get a large margin. Mathematically we have:

Minimize: (1/2)w^Tw

subject to: t⁽ⁱ⁾(w^Tx⁽ⁱ⁾+b)>=1 for i = 1,2,…,m

Note We are minimizing (1/2)w^Tw instead of ||w|| because it has a nice, simple derivative (it is just w), while ||w|| is not differentiable at w = 0. Optimization algorithms work much better on differentiable functions.

What is the cost function when training a soft margin SVM?

Study These Flashcards

Minimize: (1/2)w^Tw + CΣz⁽ⁱ⁾

subject to: t⁽ⁱ⁾(w^Tx⁽ⁱ⁾+b) >= 1-z⁽ⁱ⁾ and z⁽ⁱ⁾>= 0 for i = 1,2,…,m

We want to minize ||w||, but to get the soft margin objective, we also need to introduce a slack variable z. We now have two conflicting objectives: make the slack variables as small as possible to reduce the margin violation, and make ||w|| as small as possible to increase the margin. The hyperparameter C allow us to defined the trade-off between these two objective. Mathematically we have:

What kind of optimization problems are soft and hard margin problems?

The hard margin and soft margin problems are both convex quadratic optimization problems with linear constraints. Such problems are know as quadratic programing (QP) problems. Many off-the-shelf solvers are available to solve QP problems.

What is the dual problem and how does it apply to SVM?

Given a constrained optimization problem, know as the primal problem, it is possible to express a different but closely related problem, called its dual problem. The solution to the dual problem typically gives a lower bound to the solution of the primal problem. Luckily, the SVM problem happens to meet these conditions, so you can choose to solve the primal primal problem or the dual problem. The dual problem is faster to solve than the primal one when the number of training instances is smaller than the number of features. More importantly, the dual problem makes the kernel trick possible.

What is the primal problem?

In the primal problem, the objective function is a linear combination of n variables. There are m constraints, each of which places an upper bound on a linear combination of the n variables. The goal is to maximize or minize the value of the objective function subject to the constraints. A solution is a vector (a list) of n values that achieves the maximum value for the objective function.

In machine learning what is a kernel?

A kernel is a function capable of computing the dot product Φ(a)^TΦ(b), based only on the original vectors a and b, without having to compute (or even to know about) the transformation Φ.

What is the main idea of Mercer's theorem?

If a function K(a,b) respect a few mathematical conditions called Mercer's conditions, then there exists a function Φ that map a and b into another space such that: K(a,b) = Φ(a)^TΦ(b). You can use K as a kernel because you know Φ exist, even if you do not know what Φ is.

Support Vector Machines Flashcards

(30 cards)