Support Vector Machines Flashcards by Rasmus Johansson

In the overall classification syllabus, where do Support Vector Machines (SVMs) fit, and how do they differ conceptually from the other methods?

We have four classification algorithms discussed: two are probabilistic (Bayes classifier, Logistic regression) and two are non-probabilistic (K-Nearest Neighbors, Support Vector Machines). SVMs are non-probabilistic, focusing on maximizing margins and potentially using kernel methods to handle nonlinear data.

How well did you know this?

Not at all

Perfectly

What does it mean for an SVM to ‘maximize the margin,’ and why might that be desirable?

Maximizing the margin means finding a decision boundary (in a high-dimensional space) that is as far as possible from the nearest training points of any class. This often leads to better generalization, as having a larger margin can reduce sensitivity to small perturbations in the data.

How well did you know this?

Not at all

Perfectly

Show an example problem: Suppose we have 2D data with labels t ∈ {−1, +1}. How is a linear decision boundary represented in SVMs, and what do we need to find?

A linear decision boundary is represented by wᵀ x + b = 0, where w and b are parameters. We must find w and b that maximize the margin (i.e., the perpendicular distance between the boundary and the closest training points).

How well did you know this?

Not at all

Perfectly

What is the ‘margin’ in a linear SVM, and how do we express maximizing it in a mathematical optimization?

For two classes with labels ±1, the margin is defined as the distance from the decision boundary to the closest data points of either class. Mathematically, maximizing margin is equivalent to minimizing (1/2)||w||² subject to tn(wᵀ xₙ + b) ≥ 1 for each training point n. These constraints ensure points lie on the correct side of the margin.

How well did you know this?

Not at all

Perfectly

What are ‘support vectors,’ and why do they make SVM solutions sparse?

Support vectors are the training points that lie exactly on or within the margin boundaries (those that satisfy tn(wᵀ xₙ + b) = 1 for the hard margin case). Only these points end up with non-zero Lagrange multipliers αₙ in the dual formulation, so the decision function depends only on them – this yields a sparse solution.

How well did you know this?

Not at all

Perfectly

Present a short challenge: Suppose your linear SVM boundary is determined by w = Σ αₙ tₙ xₙ. Explain how we can compute predictions for a new point x_new using only support vectors.

The classification is t_new = sign(Σ αₙ tₙ xₙᵀ x_new + b). Because most αₙ = 0 except for the support vectors, we only sum over those data points whose αₙ ≠ 0, making the prediction dependent only on support vectors rather than the entire dataset.

How well did you know this?

Not at all

Perfectly

Why might a ‘hard margin’ SVM fail if the data are not perfectly separable, and how is this addressed in a ‘soft margin’ SVM?

A hard margin requires tn(wᵀ xₙ + b) ≥ 1 for all points, which is impossible if the data has overlap or noise. Soft margin allows slack variables ξₙ ≥ 0 for points inside or beyond the margin. The objective then becomes minimizing (1/2)||w||² + C Σ ξₙ, controlling a trade-off between margin size and data misclassification.

How well did you know this?

Not at all

Perfectly

Give an example where one data point is an outlier that severely distorts the margin under a hard margin SVM. How does the soft margin version fix that?

In a 2D dataset, imagine a single mislabeled point far on the other side from the bulk of its class. A hard margin SVM tries to separate that point perfectly, rotating the boundary drastically. A soft margin SVM instead allows a moderate ξ for that outlier by adjusting C, resulting in a more sensible boundary for the majority of the data.

How well did you know this?

Not at all

Perfectly

In the soft margin SVM, what role does the parameter C play, and how might we choose it?

C penalizes the sum of slack variables. A larger C places more emphasis on classifying every point correctly (risking overfitting), while a smaller C focuses more on maximizing the margin (risking underfitting). We typically use cross-validation to find the best C for good generalization.

How well did you know this?

Not at all

Perfectly

What are kernel functions, and why do they let us extend linear SVMs to nonlinear decision boundaries without explicitly doing the feature mapping?

A kernel k(xₙ, xₘ) represents the inner product ϕ(xₙ)ᵀ ϕ(xₘ) in some (possibly high-dimensional) feature space. By replacing xₙᵀ xₘ with k(xₙ, xₘ) in the SVM’s dual form, we can perform a ‘linear’ SVM in that transformed space without ever computing ϕ(x) explicitly. This allows complex boundaries to be learned with the same optimization.

How well did you know this?

Not at all

Perfectly

Give an example problem: You have data on a circle vs. outside a circle in 2D. Which kernel might help separate it linearly in a higher dimension, and what’s the general idea?

A polynomial kernel k(xₙ, xₘ) = (1 + xₙᵀ xₘ)ᵖ or a Gaussian kernel can help. For a circle problem, a radial (Gaussian) kernel can map data into a feature space where the circle boundary becomes a hyperplane. We never have to do the explicit mapping – just compute kernel values.

How well did you know this?

Not at all

Perfectly

Write down the dual optimization for the kernel SVM, highlighting which parts depend on the kernel k(xₙ, xₘ).

argmax_α Σₙ αₙ - (1/2) Σₙₘ αₙ αₘ tₙ tₘ k(xₙ, xₘ) subject to Σₙ αₙ tₙ = 0, 0 ≤ αₙ ≤ C. The kernel k(xₙ, xₘ) appears wherever we originally had xₙᵀ xₘ in the linear SVM.

How well did you know this?

Not at all

Perfectly

What does the decision function become in the kernelized SVM, and why might we not be able to visualize the boundary easily?

The decision is t_new = sign(Σₙ αₙ tₙ k(xₙ, x_new) + b). We can’t directly visualize w or any higher-dimensional mapping ϕ(x) because we never compute ϕ(x) explicitly. We only have the kernel evaluations, so we typically just plot decision contours numerically.

How well did you know this?

Not at all

Perfectly

Show a short example: If you use a Gaussian kernel k(xₙ, xₘ)=exp(-β||xₙ - xₘ||²), how does β affect the complexity of the decision boundary?

A small β => wide Gaussian => each point has influence over a large region => smoother, simpler boundary. A large β => narrow Gaussian => each point mostly influences nearby points => can create very wiggly, complex boundaries that may overfit. Hence β needs tuning via cross-validation.

How well did you know this?

Not at all

Perfectly

In practice, how do you tune parameters C and any kernel parameters (e.g., β for a Gaussian kernel) so the SVM generalizes well?

We typically do a grid search or randomized search over (C, β) using cross-validation. For each candidate pair, we train the SVM on the training folds, measure performance on validation folds, and choose the pair that yields the best average performance (e.g., highest accuracy or AUC).

How well did you know this?

Not at all

Perfectly

What is the 0/1 loss in classification, and why can it be misleading for imbalanced problems?

Study These Flashcards

The 0/1 loss is the proportion of misclassifications. For imbalanced data (e.g., 1% positives), a trivial classifier predicting the majority class can achieve high accuracy (e.g., 99%) but miss all minority cases, which may be unacceptable in domains like fraud detection or disease diagnosis.

Define sensitivity (Se) and specificity (Sp). How do they help measure performance in, say, a rare disease classification?

Study These Flashcards

Sensitivity (Se) = TP / (TP + FN), the proportion of actual positives caught as positive. Specificity (Sp) = TN / (TN + FP), the proportion of negatives correctly identified as negative. For a rare disease, it’s critical to see if the classifier misses diseased individuals (low sensitivity), even if specificity is high.

Provide an example problem: You have a binary classifier that outputs a score for each instance. Explain how you would generate an ROC curve from these scores.

Study These Flashcards

For each possible threshold τ, label instances as 1 if score ≥ τ else 0. Compute the true positive rate (Se) and false positive rate (1 - Sp) for that threshold. Plot the pair (1 - Sp, Se) on the ROC plane. Sweeping τ from largest to smallest traces the ROC curve, showing the trade-off between TPR and FPR.

What is the AUC (Area Under the ROC Curve), and why is it a helpful single-number summary?

Study These Flashcards

AUC is the area under the ROC curve, ranging from 0.5 (random guessing) to 1.0 (perfect classification). It provides a threshold-independent measure of a classifier’s ability to rank positive instances higher than negative ones, making it robust to class imbalance and specific threshold choices.

What is a confusion matrix, and how might it guide improvements to a multi-class classifier?

Study These Flashcards

A confusion matrix tabulates predictions vs. true labels (rows vs. columns for each class). It shows counts of correct predictions on the diagonal and where mistakes happen off-diagonal. By inspecting which classes get confused, you can target data collection or feature engineering to improve those distinctions.

Summarize the main advantages of SVMs (especially kernelized) compared to simpler linear models or KNN.

Study These Flashcards

Advantages of SVMs include a global optimum solution (quadratic programming), the ability to handle high-dimensional data, a robust margin-based approach that often generalizes well, and the flexibility of kernels for creating complex decision boundaries without explicit transformations. They can outperform basic linear models or KNN in many settings.

Give a short challenge: Suppose your SVM with a Gaussian kernel has too many support vectors and seems to memorize noise. Which hyperparameters might you adjust?

Study These Flashcards

Decrease C (to allow more margin violations and reduce overfitting) or reduce β in the Gaussian kernel (to widen the radial basis and produce a smoother boundary). Both changes reduce the complexity of the decision boundary, potentially lowering overfitting.

How can we interpret the computational complexity of SVM training, and why might this be problematic for very large datasets?

Study These Flashcards

Naive SVM solvers can scale roughly as O(N³), since the quadratic programming matrix grows with the number of training points. For extremely large N, this can be slow, and advanced or approximate methods (e.g., SMO, stochastic gradient) or sub-sampling may be needed.

Briefly describe how one might kernelize an algorithm like KNN, and why it’s not as straightforward as with SVMs or other inner-product-based methods.

Study These Flashcards

KNN relies on distances (||x_new - xₙ||²) to find nearest neighbors, rather than inner products. Though it can be adapted with kernel-based distance functions, it isn’t a simple matter of replacing xᵀx with k(x, x′). The notion of ‘nearest neighbors’ depends on a metric, not just an inner product, so additional engineering is required.

25. Conclude: What is the key idea behind Kernel SVMs, and what are the recommended steps to ensure good performance in practice?

Kernel SVMs transform data into a (potentially high-dimensional) feature space via a kernel function to learn a linear boundary there, yielding powerful nonlinear classifiers. To ensure good performance, we (1) choose an appropriate kernel, (2) tune hyperparameters (C, kernel params) via cross-validation, and (3) evaluate with robust metrics (AUC, confusion matrices) to handle imbalanced or multi-class data.

Support Vector Machines Flashcards

(25 cards)