Classification: linear models III - support vector machines Flashcards
What are a maximum margin hyperplane?
See figure!
It is the hyperplane that can have the largest margin from the hyperplane to the nearest points.
Why SVM then?
A smart way for maximum margin
Instead of maximizing 1/ ||w||
we can maximize the equevalent ½*||w||2 which is much easier computation wise (don’t know how though)
What does the SVM use for this maximization called the dual optimization problem?
It uses Lagrange multipliers
What does the solution of the dual optimization problem give us?
The solution of the dual problem gives the solution of the original problem eg. finding the optimal line.
How can we handle non linear seperable data in an SVM?
Use data transformation to map it into another feature space
Problem: suitable feature spaces may need to be very high-dimensional. That makes computations with transformed vectors φ(x) very expensive.
How do we handle the expensive computations of mapping to a higher feature space?
We use Kernels, more precisely the kernel trick, to avoid having to know the feature space, but just get a measure of similarity to use, which is much less costly!
We can replace the the dot products in the SVM with a kernel function (this method is called the kernel trick!)
Why kernels, as opposed to feature vectors?
One big reason is that in many cases, computing the kernel is easy, but computing the feature vector corresponding to the kernel is really really hard.
The feature vector for even simple kernels can blow up in size, and for kernels like the RBF kernel ( k(x,y) = exp( -||x-y||^2), (see Radial basis function kernel) the corresponding feature vector is infinite dimensional. Yet, computing the kernel is almost trivial.
How do we know if we can remove a point from the set of support vectors?
If the maximum margin hyperplane, does not change when removing this vector from the set, we do not have to keep it and it is therefore not a support vector.
What are the pros and cons of SVMs and kernels?
See figure
How can we construct kernels?
Using Mercers theorem
This gives us what we can call mercer kernels
See figure for definition!
What are the often used kernels
polynomial kernel
What are some kernels we can make for string data?
(structured data)
String features:
φu(s) := #occurrences of u as a substring of s
φ + u (s) := #occurrences of u as a subsequence of s
https://gyazo.com/1db9464264948385c2dc8e6e2d8640d2
p-Spectrum Kernel
Look for substring of length p
https://gyazo.com/3ba1c3f20eda4612f00789312a3d6b85
All subsquences kernel
The all subsequences kernel builds on the concept of the p-spectrum kernel, but allows gaps within the strings.
The all subsequence kernel, we can often not compute whole feature vector. What to do?
Use dynamic programming, by doing recurssion.
See figure