Support Vector Machines Flashcards
What is the main idea behind support vector machine?
An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on the side of the gap on which they fall.
Why do we call a SVM a large margin classifier?
1) SVM is a type of classifier which classifies positive and negative examples, here blue and red data points
2) As shown in the image, the largest margin is found in order to avoid overfitting ie,.. the optimal hyperplane is at the maximum distance from the positive and negative examples(Equal distant from the boundary lines).
3) To satisfy this constraint, and also to classify the data points accurately, the margin is maximised, that is why this is called the large margin classifier.
What do we mean by hard margin classification? What are the two main issues ?
When we strickly impose that all instances must be on one side.
The issues are:
1) It only works if the data is linearly separable.
2) It is sensitive to outliers.
What do we mean by soft Margin?
This idea is based on a simple premise: allow SVM to make a certain number of mistakes and keep margin as wide as possible so that other points can still be classified correctly. This can be done simply by modifying the objective of SVM.
What is the main idea of the kernel method?
Kernel methods are a class of algorithms for pattern analysis or recognition, The main characteristic of Kernel Methods, however, is their distinct approach to this problem. Kernel methods map the data into higher dimensional spaces in the hope that in this higher-dimensional space the data could become more easily separated or better structured.
What is the kernel trick ?
We can observe from the picture that the equation depends on the dot product of input vector pairs (xi,xj), which is nothing but a kernal function. Now here’s a good thing: we don’t have to be restricted to a simple Kernel function like dot product. We can use any fancy Kernel function in place of dot product that has the capability of measuring similarity in higher dimensions (where it could be more accurate; more on this later), without increasing the computational costs much. This is essentially known as the Kernel Trick.
How did we create that circle using a kernel function?
1 - Each point P is represented by (x,y) coordinates in 2D space.
2 - We project the points to 3D space by transforming their coordinates to (x^2, y^2, √2xy)
3 - Points which have high value of x.y would move upwards along the z-axis (in this case, mostly the red circles).
4 - We find a hyperplane in 3D space that would perfectly separate the classes.
5 - The form of Kernel function indicates that this hyperplane would form a circle in 2D space, thus giving us a non-linear decision boundary.
By embedding the data in a higher-dimensional feature space, we can keep using a linear classifier! There is a nice visualization of this exact solution here:
https://www.youtube.com/watch?v=3liCbRZPrZA
What is the goal when training a linear SVM classifier?
Training a linear SVM classifier means finding the values of w (weight vector or parameters) and b (the bias term) that make the margin as wide as possible while avoid margin violation (hard margin) or limiting them (soft margin).
What is a polynomial kernel
In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models.
With so many kernels to choose from, how can you decide which one to use?
As a rule of thumb, you should always try the linear kernel first. If the training set is not to large, you should also try the Gaussian RBF kernal.
Some kernels may be specialized for your training set’s data structure. If these two kernal does not worth other kernels might be worth a look.
According to sklearn, what are the avantage and disavantages of SWM?
The advantages of support vector machines are:
1) Effective in high dimensional spaces.
2) Still effective in cases where number of dimensions is greater than the number of samples.
3) Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
4) Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.
The disadvantages of support vector machines include:
1) If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
2) SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).
What is the linear kernel formula?
The linear kernel is the simplest kernel function. It is given by the inner product plus an optional constant c.
k(x,y)=xTy+c
What is the polynomial kernel formula?
k(x,y) = (axTy+c)d
The polynomial kernel is a non stationary kernel. Polynomial kernels are well suited for problems where all the training data is normalized. Adjustable parameters are the slope a, the constant term c and the polynomial degree d.
What is the gaussian kernel formula?
k(x,y) = exp(-ɤ||x-y||2)
The Gaussian kernel is an example of the radial basis function kernel. The adjustable parameter sigma plays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand. If overestimated, the exponential will behave almost linearly and the higher-dimenensional projection will start to lose its non-linear power. In the other hand, if underestimated, the function will lack regularization and the decision boundary will be highly sensitive to noise.
For the linear SVM classifier, what is the Decision function?
yp {0 if wTx+b <0 and 1 if wTx+b>= 0}
Note yp is the predicted value, w is the weights vector (parameter vector) and b is the bias term. If the result is positive the predicted class yp is the positive class (1), and otherwise it is the negative class (0).