Support Vector Machines Flashcards
DS
What parameters can you tune for SVM?
The hyperparameters that you can commonly tune for SVM are:
- Regularization/cost parameter
- Kernel
- Degree of polynomial (when using a poly kernel)
- Gamma (modifies the influence of nearby points on the support vector for Gaussian radial basis RBF kernels)
- Coef0: (influences impact of high vs. low degree polynomials for poly or sigmoid kernels)
- Epsilon: (a margin term used for SVM regressions)
What are some possible uses of SVM models? e.g. classification, regression, etc?
SVM can be used for:
- linear classification
- nonlinear classification
- linear regression
- nonlinear regression
What common Kernels can you use for SVM?
- Linear
- Polynomial
- Gaussian RBF (radial basis)
- Sigmoid
Why is it important to scale a feature before using SVM?
SVM tries to fit the widest gap between all classes, so unscaled features can cause some features to have a significantly larger or smaller impact on how the SVM split is created.
Can SVM produce a probability score along with its classification output?
No
What does the Support Vector Classifier do to Separate Classes?
If data is linearly separable by a hyper plane (dim p-1), then there are an infinite number of such hyperplanes.
In order to obtain one clf among the infinite number, we can choose the maximal margin hyperplane, which is the h which has furthest perpendicular distance from the training data points.
Explain the improvement in classification separation that max margin clfs make.
say we have obese “x”, not obese “o” binary labels with a decision threshold, “|”
x x x x x | o o o o o
if we get a new obs “z” near the decision bound
x x x x x | z o o o o o
The issue is the clf will assign “z” to class “o”, even though it is closer in euclidean distance to class “x”.
A max margin clf (HARD MARGIN THRESHOLD) focuses on the observations “s1”, “s2” at the EDGES of each class cluster and USES THE MIDPOINT BETWEEN THESE EXTREME observations as the decision THRESHOLD:
x x x x s1——-|——s2 o o o o
Now, when a new obs falls on the left side, it will be closer to obese cluster labels with “x”.
This increases bias (less precise fit) and reduces variance (generalizes better).
BAM!
Explain the effects of the maximal margin classifier w.r.t. distances.
Max margin clfs (HARD MARGIN THRESHOLD) assign the midpoint between support vectors of two classes as the DECISION THRESHOLD.
The SHORTEST DISTANCE between the observations and the decision threshold is called the MARGIN.
Because max margin clfs assign the midpoint which as the threshold which SEPARATE the class obs, the distance between the s1 and s2 to the threshold are the SAME.
When the threshold is HALF-WAY between the separate classes, the margin is AS LARGE AS IT CAN BE.
Say we move the max margin midpoint a little to the right, then the left side margin increases but the right side margin decreases.
A problem arises, however, when label outliers occur such as:
x x x x x o o o
The HARD MARGIN clf will assign the midpoint decision threshold as
x x x x x | o o o
which reduces BIAS in training but increases VARIANCE in validation and is unlikely to generalize to new data.
This is a downside to strict max margin clfs.
How does a SVC improve over a strict max margin clf?
An SVC does NOT follow the rule of a HARD MARGIN, instead it assigns the decision threshold by cross validation to obtain the best bias/variance outcome on the validation set. In this way, an SVC avoids the problem of assigning thresholds based on outliers and ALLOWS some obs to VIOLATE the margins and thus will improve upon bias/variance by allowing more bias (worse fit in training) for lower variance (reduce overfitting in validation), e.g.
x x x x x | o o o
instead SVC will allow a misclassification for improved generalizability
x x x x | x o o o
Explain how SVMs generally work
Say we have drug efficacy observations with labels “x”, “o” and dosage on x-axis and a y-axis as a second dimension:
x axis: o o o o o x x x x o o o o
the data indicates that low or high dosages are ineffective, but a moderate dosage is effective. How can we separate this data?
SVM algo:
- start with data in low dimension x-axis (we have 1d efficacy data)
- move the data into 2 dimension by adding y-axis (let y = x^2) to get separating line
- find a support vector that SEPARATES the transformed 2d data into two classes
If we specify kernel hyper d=3 for a poly 3rd dim, we get transform the data set into 3d space and obtain a SVM separating plane through the classes.
Provide a high level description of how SVM kernels work.
Kernel functions only compute RELATIONSHIPS between every pair of points AS IF they are IN A HIGHER DIMENSION, but they DON’T ACTUALLY DO A TRANSFORMATION.
This is called THE KERNEL TRICK: a process which REDUCES AMOUNT OF COMPUTATION in SVMs by avoiding math that transforms the data from low to higher dims.
TRIPLE BAM!