Section 4 Support Vector Classifiers Flashcards

1
Q

What is a linear classifier and give an example

A

A linear classifier classifies observations on the basis of a linear combination of input features and parameters ie. a hyperplane. Logistic regression is an example of a linear classifier where the hyperplane is mapped to the probability that an instance belongs to a class. It defines a linear boundary hence is a linear classifier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the reason for SVM theory

A

The reason for SVM theory is that logistic regression always identifies a linear boundary even if data overlaps or it’s clear the data isn’t linear. Support Vector machines allow for non linear decision boundaries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

If data is perfectly separable how can we construct a classifier using separating hyperplane?

A

We can construct a classifier using the hyperplane which maximally separates the two classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does the separating hyperplane classify?

A

The classifier will classify a (new) observation according to which side of the plane it lies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the margin for data which is separable

A

For any separating line, we can look at how far it is to the closest point in perpendicular distance from each class. This distance is called the margin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define the maximum separating hyperplane for data which is separable

A

The maximum separating hyperplane is the separating plane which has maximum margin for separating the classes: ensures the largest difference between classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the support vectors for data which is separable

A

The lines that define the margin pass through data points from each class are called the positive and negative support vectors. The maximum separating hyperplane is halfway between the two support vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is the optimization problem to find the maximum separating hyperplane solved?

A

The optimization problem is solved using constrained optimization with Lagrange multipliers: It’s a complex process to solve the constrained optimisation problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is meant by a hard margin

A

When the support vector classifier ie the hyperplane found by optimisation is found with a constraint which ensures there are no errors - no points on the wrong side of the support vectors. Perfect classification with boundary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is the dual formation key in the optimisation problem to solve for the maximum separating hyperplane.

A

This dual formation is key as it expresses the optimisation criterion as inner products of the observations xi.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain what is meant by a soft margin

A

When the support vector classifier ie the hyperplane found by optimisation is found with a constraint which allows the margin to be more soft: maximising margin but allow some observations on the wrong side of separating hyperplane, equivaWhatlent to constraining the problem using a cost function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the Cost C stand for

A

The cost is related to the number of observations violating the margins. The larger the cost, the less tolerant we are for violations to the margin and the more we strive to get all the (training) data points classified correctly.
For small C, the classifier will be tolerant to a certain degree of misclassified observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define the slack variable

A

A slack variable ξi indicates where an observation is located relative to the hyperplane and relative to the margin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What values can the slack variables take?

A

ξi = 0 means observation is on correct side of margin
ξi > 0 means observation is on wrong side of margin
ξi > 1 means observation is on wrong side of hyperplane

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does the cost C control the complexity of the prediciton fucntion

A

The cost C expression is a penalty. C is fixed in advance, controlling the penalty paid by the classifier for misclassifying a training point and thus the complexity of the prediction function.
A high cost C will force the classifier to create a complex enough prediction function to misclassify as few training points as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why are kernels used in this branch of machine learning

A

To address non-linearity, the idea is to enlarge the input feature space using the function of the predictors to include non-linear functional terms. We need more flexibility.
Kernels allow us to do this efficiently.

17
Q

What is a kernel

A

Consider two generic input vectors xi and xii of the input feature space.
A kernel function is any function on the space defined by (xi, xh) that behaves like an inner product.
The kernel function quantifies the similarity between two observations. A kernel function K(·) is a generalisation of the inner product. The mapping usually takes the data into a higher dimensional space.

18
Q

Inw what ways are kernels more efficient

A

Computing the feature mapping can be inefficient. Also, using the mapped representation can be inefficient.
So instead of using the original data x were going to enlarge the data by polynomials using kernels which is much easier.
Kernels mean mapping isnt computed explicitly and computations with mapped features remain efficient

19
Q

Explain concisely the kernel trick

A

Implicit computation of the dot product and feature mapping to a higher dimensional space.
The trick is given by formula:
Hence the dot product of feature mappings which is computationally expensive is equal to the kernel function applied to the original input space which is cheaper
A SVM uses a non linear mapping to define a non linear classifier and this is implemented via the kernel function is a very efficient manner: Give kernel function

20
Q

Summarise the return of the kernel function

A

The kernel function returns the inner product between two points in the enlarged feature space (thus defining a notion of similarity) with little computational cost even in very high-dimensional spaces.

21
Q

Name five kernel functions

A

Linear, polynomial, Gaussian Radial Basis Function kernel, Laplace radial basis function kernel, sigmoid kernel.

22
Q

What is the GRBF kernel used for

A

It assumes each data point has a gaussian distribution centred round another data point ex: Point i is normal around the mean) If sigma is set to be small the function is less flexible.

23
Q

What is the polynomial kernel used for

A

Polynomial kernel can for example account for quadratic decision boundaries or cubic

24
Q

Kernels and support vector classifiers can build what

A

support vector machines for classification.

25
Q

What is the goal of a support vector classifier linear or otherwise.

A

A support vector classifier seeks to maximise the distance between the classes, i.e.margin.
The problem of maximising the margin can be expressed in terms of inner products, so we can express these using kernels.
By including kernels we define support vector machines.
The solution to the support vector classifier (both hard and soft) depends only on the inner products of the observations, xi^T xh.

26
Q

Define a support vector machine

A

The support vector machine (SVM) is an extension of the support vector classifier resulting from enlarging the feature space using kernels.

27
Q

Define a standard support vector classifier

A

A SVM with linear kernel function

28
Q

How is the optimising problem underlying a standard support vector classifier solved

A

It can be shown that the optimization problem underlying a support vector classifier f(xi) corresponds to the minimization of the penalised loss function

29
Q

What is the the penalised loss function or the hinge function

A

The loss function comparing observed vs predicted estimated probabilities from the model. Correctly classified points will have a small loss size, while incorrectly classified instances will have a high loss size.

30
Q

Why is it important to distinguish SVM and logistic regression - differing results?

A

For a SVM with non-linear kernel function, SVM and logistic regression tend to give different results! SVM will give in general better predictive performance than the logistic model.

31
Q

What is a downfall of SVM

A

SVM is very hard to interpret and it’s only use is prediction classification: can’t be used for anything else.
The predictive performance of SVM can be very sensitive to the choice of the kernel and the cost C.
Differently from logistic regression SVM does not return estimated class probabilities.
What if we wish to compute AU-ROC and AU-PR? Or consider different probability thresholds for classification? (The Platt’s method)

32
Q

Explain overfitting and underfitting in a SVM classifier

A

A too flexible SVM classifier will work well on the training data, but it may have reduced predictive performance on new observations (overfitting).
A not flexible enough SVM classifier will extract reduced information from training data, deteriorating the predictive performance on new observations (underfitting).
Flexibility depends on the choice of the kernel and the Cost C

33
Q

What can be employed to select the best SVM

A

Cross-validation can be employed to select among a collection of different SVM characterised by different kernel functions and different hyperparameters.

34
Q

What approach is used for multiclass classification?

A

One-vs-one approach (OVO)

35
Q

What is Platts method

A

Platt’s method: Allows for the estimate posterior class probabilities Pr(yi = 1|xi) from the output of SVM.

36
Q

Describe the advantages of the kernel trick and how it is employed to imple-
ment support vector machines

A

Define kernel trick .
Kernel function is a generalization of the inner product where phi is a mapping function of data. Mapping takes data into higher dimensional space which kernels can do more efficiently.
Advan:
Mapping doesnt have to be computed directly
Computations with mapped features remain efficient
Less computational cost

37
Q

Describe the advantages of the kernel trick

A

Advan:
Mapping doesnt have to be computed directly
Computations with mapped features remain efficient
Less computational cost