Section 4 Support Vector Classifiers Flashcards
What is a linear classifier and give an example
A linear classifier classifies observations on the basis of a linear combination of input features and parameters ie. a hyperplane. Logistic regression is an example of a linear classifier where the hyperplane is mapped to the probability that an instance belongs to a class. It defines a linear boundary hence is a linear classifier.
What is the reason for SVM theory
The reason for SVM theory is that logistic regression always identifies a linear boundary even if data overlaps or it’s clear the data isn’t linear. Support Vector machines allow for non linear decision boundaries.
If data is perfectly separable how can we construct a classifier using separating hyperplane?
We can construct a classifier using the hyperplane which maximally separates the two classes.
How does the separating hyperplane classify?
The classifier will classify a (new) observation according to which side of the plane it lies
What is the margin for data which is separable
For any separating line, we can look at how far it is to the closest point in perpendicular distance from each class. This distance is called the margin.
Define the maximum separating hyperplane for data which is separable
The maximum separating hyperplane is the separating plane which has maximum margin for separating the classes: ensures the largest difference between classes.
What are the support vectors for data which is separable
The lines that define the margin pass through data points from each class are called the positive and negative support vectors. The maximum separating hyperplane is halfway between the two support vectors
How is the optimization problem to find the maximum separating hyperplane solved?
The optimization problem is solved using constrained optimization with Lagrange multipliers: It’s a complex process to solve the constrained optimisation problem
What is meant by a hard margin
When the support vector classifier ie the hyperplane found by optimisation is found with a constraint which ensures there are no errors - no points on the wrong side of the support vectors. Perfect classification with boundary.
Why is the dual formation key in the optimisation problem to solve for the maximum separating hyperplane.
This dual formation is key as it expresses the optimisation criterion as inner products of the observations xi.
Explain what is meant by a soft margin
When the support vector classifier ie the hyperplane found by optimisation is found with a constraint which allows the margin to be more soft: maximising margin but allow some observations on the wrong side of separating hyperplane, equivaWhatlent to constraining the problem using a cost function.
What does the Cost C stand for
The cost is related to the number of observations violating the margins. The larger the cost, the less tolerant we are for violations to the margin and the more we strive to get all the (training) data points classified correctly.
For small C, the classifier will be tolerant to a certain degree of misclassified observations.
Define the slack variable
A slack variable ξi indicates where an observation is located relative to the hyperplane and relative to the margin.
What values can the slack variables take?
ξi = 0 means observation is on correct side of margin
ξi > 0 means observation is on wrong side of margin
ξi > 1 means observation is on wrong side of hyperplane
How does the cost C control the complexity of the prediciton fucntion
The cost C expression is a penalty. C is fixed in advance, controlling the penalty paid by the classifier for misclassifying a training point and thus the complexity of the prediction function.
A high cost C will force the classifier to create a complex enough prediction function to misclassify as few training points as possible.
Why are kernels used in this branch of machine learning
To address non-linearity, the idea is to enlarge the input feature space using the function of the predictors to include non-linear functional terms. We need more flexibility.
Kernels allow us to do this efficiently.
What is a kernel
Consider two generic input vectors xi and xii of the input feature space.
A kernel function is any function on the space defined by (xi, xh) that behaves like an inner product.
The kernel function quantifies the similarity between two observations. A kernel function K(·) is a generalisation of the inner product. The mapping usually takes the data into a higher dimensional space.
Inw what ways are kernels more efficient
Computing the feature mapping can be inefficient. Also, using the mapped representation can be inefficient.
So instead of using the original data x were going to enlarge the data by polynomials using kernels which is much easier.
Kernels mean mapping isnt computed explicitly and computations with mapped features remain efficient
Explain concisely the kernel trick
Implicit computation of the dot product and feature mapping to a higher dimensional space.
The trick is given by formula:
Hence the dot product of feature mappings which is computationally expensive is equal to the kernel function applied to the original input space which is cheaper
A SVM uses a non linear mapping to define a non linear classifier and this is implemented via the kernel function is a very efficient manner: Give kernel function
Summarise the return of the kernel function
The kernel function returns the inner product between two points in the enlarged feature space (thus defining a notion of similarity) with little computational cost even in very high-dimensional spaces.
Name five kernel functions
Linear, polynomial, Gaussian Radial Basis Function kernel, Laplace radial basis function kernel, sigmoid kernel.
What is the GRBF kernel used for
It assumes each data point has a gaussian distribution centred round another data point ex: Point i is normal around the mean) If sigma is set to be small the function is less flexible.
What is the polynomial kernel used for
Polynomial kernel can for example account for quadratic decision boundaries or cubic
Kernels and support vector classifiers can build what
support vector machines for classification.
What is the goal of a support vector classifier linear or otherwise.
A support vector classifier seeks to maximise the distance between the classes, i.e.margin.
The problem of maximising the margin can be expressed in terms of inner products, so we can express these using kernels.
By including kernels we define support vector machines.
The solution to the support vector classifier (both hard and soft) depends only on the inner products of the observations, xi^T xh.
Define a support vector machine
The support vector machine (SVM) is an extension of the support vector classifier resulting from enlarging the feature space using kernels.
Define a standard support vector classifier
A SVM with linear kernel function
How is the optimising problem underlying a standard support vector classifier solved
It can be shown that the optimization problem underlying a support vector classifier f(xi) corresponds to the minimization of the penalised loss function
What is the the penalised loss function or the hinge function
The loss function comparing observed vs predicted estimated probabilities from the model. Correctly classified points will have a small loss size, while incorrectly classified instances will have a high loss size.
Why is it important to distinguish SVM and logistic regression - differing results?
For a SVM with non-linear kernel function, SVM and logistic regression tend to give different results! SVM will give in general better predictive performance than the logistic model.
What is a downfall of SVM
SVM is very hard to interpret and it’s only use is prediction classification: can’t be used for anything else.
The predictive performance of SVM can be very sensitive to the choice of the kernel and the cost C.
Differently from logistic regression SVM does not return estimated class probabilities.
What if we wish to compute AU-ROC and AU-PR? Or consider different probability thresholds for classification? (The Platt’s method)
Explain overfitting and underfitting in a SVM classifier
A too flexible SVM classifier will work well on the training data, but it may have reduced predictive performance on new observations (overfitting).
A not flexible enough SVM classifier will extract reduced information from training data, deteriorating the predictive performance on new observations (underfitting).
Flexibility depends on the choice of the kernel and the Cost C
What can be employed to select the best SVM
Cross-validation can be employed to select among a collection of different SVM characterised by different kernel functions and different hyperparameters.
What approach is used for multiclass classification?
One-vs-one approach (OVO)
What is Platts method
Platt’s method: Allows for the estimate posterior class probabilities Pr(yi = 1|xi) from the output of SVM.
Describe the advantages of the kernel trick and how it is employed to imple-
ment support vector machines
Define kernel trick .
Kernel function is a generalization of the inner product where phi is a mapping function of data. Mapping takes data into higher dimensional space which kernels can do more efficiently.
Advan:
Mapping doesnt have to be computed directly
Computations with mapped features remain efficient
Less computational cost
Describe the advantages of the kernel trick
Advan:
Mapping doesnt have to be computed directly
Computations with mapped features remain efficient
Less computational cost