Chapter 9 - Non-linear Boudary, Kernal, SVM Flashcards
How do we use the SVC with non-linear boundaries?
- similar to logistic regression we can transform predictors (add a quadratic predictor)
- expand the set of predictors to a larger space.
(X1,X2) goes to (X1,X2,X1) with Beta3*X1^2. the boundary is quadratic in X1
Idea: add polynomial terms up to degree d
doesn’t make the computation more expensive because all we need to know is to the dot product
Kernels
a kernel is positive definite matrix (symmetric with positive eigen values) which is defined by K(i,k) = for a set of zi, …, zk linearly independent vectors
The kernel trick
instead of defining a mapping that expands the set of predictors and then computing the dot product, use the kernel trick.
1) prove that a function is positive definite
2) and then compute K using that function
Kernel specifics (3)
proving that a kernel is PD is not always easy. However we can use the rule that the sums and products of PD kernels are PD. Intuitively a kernel defines the similar between two observations.
common kernels: polynomial kernel, radial basis kernel
kernels are also great at dealing with non-standard data types (strings, graphs, trees, images). It is easier to define a similarity measure which is PD than a set of features that capture the information in each sample
Support Vector Machines
a support vector classifier applied to an expanded set of predictors.
- we expand the set of predictors for each sample x_i then perform the algorithm
- we again only need to know the dot products for every pair of samples
The Kernel trick and SVM
often the dot product is a simple function of the original vectors (even if the mapping significantly expands the space of predictors). In some cases, you can use the kernel trick even if you are expanding into an infinite number of transformations (radial kernel)
- this works because we can evaluate whether K is a measure of similarity without understanding the transformation
Kernel trick beyond SVMs
Kernel PCA - suppose we wanted to do PCA on an expanded set of predictors, using PCA.
- if we want to make a biplot all we need to know if the kernel or Gram matrix
- even if phi maps p to a very high dimensional space, we can still do PCA
- the cost is only a fcn of n
SVM to more than 2 classes. (2 approaches)
SVMs don’t generalize nicely to cases of more than 2 classes.
1 v 1 - compare n choose 2 SVMs choosing every pair classes. assign a sample i to the class that wins the most 1 v 1 challenges
1 v all - One vs. all: Construct a classifier beta(k) for every class k against all others. Assign a sample i to the class k, such that the distance from xi to the hyperplane defined by beta (k) is largest (the distance is negative if the sample is misclassified).
relationship to logistic regression
we can formulate the method for finding an SVM as a Loss + Penalty optimization. for each sample we incur a loss max[0 , 1 - yif(x)]. in logistic regression we minimize a similar type of loss min[sum[log[1 + exp(-yif(X))]]