Chapter 5: Support Vector Machines Flashcards

1
Q

What is a support vector?

A

After training an SVM, a support vector is any instance located on the “street” (see the previous answer), including its border. The decision boundary is entirely determined by the support vectors. Any instance that is not a support vector (i.e., is off the street) has no influence whatsoever; you could remove them, add more instances, or move them around, and as long as they stay off the street they won’t affect the decision boundary. Computing the predictions with a kernelized SVM only involves the support vectors, not the whole training set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

What is the fundamental idea behind support vector machines?

A

The fundamental idea behind Support Vector Machines is to fit the widest possible “street” between the classes. In other words, the goal is to have the largest possible margin between the decision boundary that separates the two classes and the training instances. When performing soft margin classification, the SVM searches for a compromise between perfectly separating the two classes and having the widest possible street (i.e., a few instances may end up on the street). Another key idea is to use kernels when training on nonlinear datasets. SVMs can also be tweaked to perform linear and nonlinear regression, as well as novelty detection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is it important to scale the inputs when using SVMs?

A

SVMs try to fit the largest possible “street” between the classes (see the first answer), so if the training set is not scaled, the SVM will tend to neglect small features (see Figure 5–2).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Can an SVM classifier output a confidence score when it classifies an instance? What about a probability?

A

You can use the decision_function() method to get confidence scores. These scores represent the distance between the instance and the decision boundary. However, they cannot be directly converted into an estimation of the class probability. If you set probability=True when creating an SVC, then at the end of training it will use 5-fold cross-validation to generate out-of-sample scores for the training samples, and it will train a LogisticRegression model to map these scores to estimated probabilities. The predict_proba() and predict_log_proba() methods will then be available.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can you choose between LinearSVC, SVC, and SGDClassifier?

A

All three classes can be used for large-margin linear classification. The SVC class also supports the kernel trick, which makes it capable of handling nonlinear tasks. However, this comes at a cost: the SVC class does not scale well to datasets with many instances. It does scale well to a large number of features, though. The LinearSVC class implements an optimized algorithm for linear SVMs, while SGDClassifier uses Stochastic Gradient Descent. Depending on the dataset LinearSVC may be a bit faster than SGDClassifier, but not always, and SGDClassifier is more flexible, plus it supports incremental learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Say you’ve trained an SVM classifier with an RBF kernel, but it seems to underfit the training set. Should you increase or decrease γ (gamma)? What about C?

A

If an SVM classifier trained with an RBF kernel underfits the training set, there might be too much regularization. To decrease it, you need to increase gamma or C (or both).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does it mean for a model to be ϵ-insensitive

A

A Regression SVM model tries to fit as many instances within a small margin around its predictions. If you add instances within this margin, the model will not be affected at all: it is said to be ϵ-insensitive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the point of using the kernel trick?

A

The kernel trick is mathematical technique that makes it possible to train a nonlinear SVM model. The resulting model is equivalent to mapping the inputs to another space using a nonlinear transformation, then training a linear SVM on the resulting high-dimensional inputs. The kernel trick gives the same result without having to transform the inputs at all.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Train a LinearSVC on a linearly separable dataset. Then train an SVC and a SGDClassifier on the same dataset. See if you can get them to produce roughly the same model.

A

Load data
Set loss=hinge and kernel=linear for all three models
SGDClassifier has alpha instead of C
Define C=5 & alpha=0.05
Scale using StandardScaler().fit_transform
Call
LinearSVC(loss=’hinge’, C=C, dual=True, random_state=42).fit(X_scaled, y)
SVC(kernel=’linear’, C=C).fit(X_scaled, y)
SGDClassifier(alpha=alpha, random_state=42).fit(X_scaled, y)

Plot boundaries
def compute_decision_boundary(model):
w = -model.coef_[0, 0] / model.coef_[0, 1]
b = -model.intercept_[0] / model.coef_[0, 1]
return scaler.inverse_transform([[-10, -10 * w + b], [10, 10 * w + b]])

lin_line = compute_decision_boundary(lin_clf)
svc_line = compute_decision_boundary(svc_clf)
sgd_line = compute_decision_boundary(sgd_clf)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Train an SVM classifier on the wine dataset, which you can load using sklearn.datasets.load_wine(). This dataset contains the chemical analyses of 178 wine samples produced by 3 different cultivators: the goal is to train a classification model capable of predicting the cultivator based on the wine’s chemical analysis. Since SVM classifiers are binary classifiers, you will need to use one-versus-all to classify all three classes. What accuracy can you reach?

A

Load data
Split into train-test
Call LinearSVC(dual=True, random_state=42)
Fit LinearSVC() on train data
Did not converge > there must be a problem, but still get accuracy as baseline
Get accuracy of model using cross_val_score(model, X_train, y_train).mean()
Problem: unscaled features
Scale features using StandardScaler(), call and fit LinearSVC
Use SVC, fine tune svc_gamma and svc_C by using RandomizedSearchCV
Get rnd_search_cv.best_score_ for training
Get rnd_search_cv.score(X_test, y_test) for test

Important to scale before fitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Train and fine-tune an SVM regressor on the California housing dataset. You can use the original dataset rather than the tweaked version we used in Chapter 2, which you can load using sklearn.datasets.fetch_california_housing(). The targets represent hundreds of thousands of dollars. Since there are over 20,000 instances, SVMs can be slow, so for hyperparameter tuning you should use far fewer instances (e.g., 2,000) to test many more hyperparameter combinations. What is your best model’s RMSE?

A

Load data
Split train_test_split(X, y, test_size=0.2, random_state=42)
lin_svr = make_pipeline(StandardScaler(), LinearSVR(dual=True, random_state=42))
lin_svr.fit(X_train, y_train)
Increase max_iter if did not converge LinearSVR(max_iter=5000, ….)
y_pred = lin_svr.predict(X_train)
mse = mean_squared_error(y_train, y_pred)
mse
RMSE = np.sqrt(mse)
RMSE = 0.979565447829459
The RMSE gives a rough idea of the kind of error you should expect (with a higher weight for large errors): so with this model we can expect errors close to $98,000!

Use RandomizedSearchCV to search best gamma and C for
svm_reg = make_pipeline(StandardScaler(), SVR())
rnd_search_cv = RandomizedSearchCV(svm_reg, param_distrib, n_iter=100, cv=3, random_state=42)
rnd_search_cv.fit(X_train, y_train)
rnd_search_cv.best_estimator_
-cross_val_score(rnd_search_cv.best_estimator_, X_train, y_train,
scoring=”neg_root_mean_squared_error”)

y_pred = rnd_search_cv.best_estimator_.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
rmse 0.5854732265172222

How well did you know this?
1
Not at all
2
3
4
5
Perfectly