Kernelization Flashcards
Explain feature maps, and why they are not always a good solution.
By applying a Feature map we add a non-linear transformation of the features in a linear model. For example, we can add the power of certain features as extra features. e.g. add x2 1 as one term. this provides a way of fitting different curve shapes to the problem.
Adding new features comes at a memory and computational cost. We need to learn more weights, and the risk of overfitting becomes larger. For example, by adding more features to Ridge Regression, calculating the closed-form solution becomes quadratically harder with the amount of features.
Explain kernel functions in relation to feature maps.
A feature map is a function that takes feature vectors in one space and transforms them into feature vectors in another. A kernel function is a function that computes the inner product of two feature vectors in a transformed space. Kernel functions can be used to implicitly map data into higher-dimensional spaces without explicitly computing the feature maps.
The kernel function calculates the distances between points cheaply without explicitly constructing the high-dimensional space at all.
The dot product is a measure of similarity between to vectors, hence, a kernel can be seen as a similarity measure for high-dimensional spaces.
A loss function can be kernelized if it contains a dot product. The dot product (e.g. xi * xj) can simply be replaced by k(xi, xj).
Explain the Polynomial kernel. Provide the most important hyperparameter.
The polynomial kernel reproduces the polynomial feature maps where:
- Gamma is a scaling parameter (default 1/p)
- c0 is a hyperparameter (default to 1) to trade off influence of the higher order terms.
kpoly(x1, x2) = (γ(x1 ⋅ x2) + c0)d
Explain the Redial Basis Function Kernel. Provide the most important hyperparameters. How can these hyperparameters lead to overfitting and underfitting?
The RBF feature map builds the Taylor series expansion of ex.
The RBF Kernel looks like:
kRBF (x1, x2) = exp(-γ || x1 - x2 ||2)
- The RBF kernel does not use a dot product, it only considers the distance between x1 and x2.
- It’s a local kernel: every data point only influences data points nearby, whereas linear and polynomial kernels are global : every point affects the whole space.
- Similarity depends on closeness of points and kernel width.
The most important hyper parameters are γ (gamma) and C (cost of margins violations).
- γ: kernel width. High values cause narrow Gaussians, which produces more support vectors. This leads to overfitting. Low values cause wide Gaussians, which leads to underfitting.
- C: cost of margin violations. High values punish margin violations, causing narrow margins leading to overfitting. Low values cause wider margins, resulting in more support vectors leading to underfitting.
In practice what are some general guidelines to get good results with kernels in SVM’s?
- C and gamma always need to be tuned. Find a good C, then finetune gamma.
- SVMs expect all features to be approximately on the same scale. Therefore data needs to be scaled beforehand.
When would an SVM be a suitable choice as a model?
SVM’s work well on both low- and high dimensional data, but they are especially good at small, high dimensional data.