B07 Support Vector Machines Flashcards
What are Support Vector Machines?
Approach that represents data as points in multi-dimensional space, in such a way that points with separate labels are divided by a clear gap.
What is a hyperplane?
Given a set of data points that belong to either one of two classes, We can draw a line (a hyperplane) that separates the data into two partitions based on the label. *Linearly Separable data
What is the Maximum Margin Hyperplane? (MMH)
With linearly separable data, because there is a non-zero distance between the two closest points between labels, there are an infinite number of potential hyperplanes.
-The goal is to identify the hyperplane that creates the
greatest separation between classes. The MMH
What are support vectors?
- The points from each class which are closest to the MMH are known as the support vectors.
- Each class must have one or more support vectors.
What is Quadratic Optimization?
Linearly Separable Data
-A technique for finding the MMH.
-This approach attempts to find the perpendicular bisector of the
shortest line between the outer boundaries of the classes.
Other methods for finding the MMH?
-An alternative technique involves a search through the space of
every possible hyperplane in order to identify the MMH.
-Each hyperplane is defined as:
w⃗ ⋅ x ⃗ + b = 0
To maximize the distance
between the hyperplanes, we
need to minimize the value of
, this ∥ w⃗ ∥ is expressed as:
- The goal is to find a set of weights that specify two hyperplanes:
- Using vector geometry, the distance between the two hyperplanes is defined as:
w⃗ ⋅ x ⃗ + b ≥ + 1
w⃗ ⋅ x ⃗ + b ≤ − 1
2/
∥ w⃗ ∥
Non-linearly separable data
What is Soft-Margin classification?
-A simple approach to dealing with non-linearly separable data
is to allow a small number of points that are close to the
boundary to be misclassified.
In the context of soft margin classification, what is C?
-The number of possible misclassifications is governed by
a user defined parameter C, which is called the cost.
-The higher the value of C, the less likely it is that the algorithm will
misclassify a point.
- With the introduction of a slack variable ( Sigma) and a cost ( C) to the model, instead of finding maximum margin, we focus on
finding minimum total cost.
What is the Kernel trick?
-Some real life patterns cannot be dealt with by simply using soft
classifiers.
-Patterns which need multiple and/or non-linear boundaries are
dealt with using an approach known as the kernel trick.
What is a Kernel?
-A kernel is a function that computes the dot product
between two vectors.
-Given two vectors Xi and Xj, the kernel function:
combines them into a single number by computing their
dot product.
-Phi represents the mapping of our vectors to a new space.
The idea behind the Kernel trick?
-The idea behind the kernel trick is to map the classification
problem to a space in which the problem is rendered
separable via a separation boundary that is simple in the
new space, but complex in the original one.
-The transformed space typically has higher dimensionality,
with each of the dimensions being (possibly complex)
combinations of the original problem variables.
Kernel Trick Slides illustrated
Choosing the right Kernel
-Choosing the most appropriate kernel highly depends on
the problem at hand.
-Fine tuning the parameters of a kernel can easily become a
tedious and cumbersome task.
-This is often done in an iterative way. However, there are
some automated kernel selection tools out there.
Some common Kernel Functions: