Support Vector Machines Flashcards
How do you decide which resultant classification is best? Why does this occur?
There are multiple possible separation boundaries, each represented by local minima but there is often a minima that is better than others because it:
> Correctly classifies all points
> Is as far away from all points as possible
[Picture 9]
What are the classes that we use in support vector machines?
-1 and 1
How can we express that a point is correctly classified (using an equation)?
t(w^Tx + w_0) ≥ 0
What happens when t(w^Tx + w_0) = 0 ?
This means that the point is on the decision boundary. This means that the classifier is undecided. This situation should be avoided
How can we express that all points are correctly classified and none are on the decision boundary? how can this equation be transformed?
t(w^Tx + w_0) ≥ ε
ε > 0
We can divide both sides by ε and set the inequality to any constant that we want (example 1)
t(w^Tx + w_0) ≥ 1
Why can the equations: t(w^Tx + w_0) ≥ ε ε > 0 become: t(w^Tx + w_0) ≥ p with p being any positive number?
Becayse we can just rescale the weights to whaever we want
What is the margin?
The distance between the closest point to the separation boundary and the boundary itself
[Picture 10]
What do we want from the magin?
We want to maximise it to get the best classification
How do we calculate the distance of points to the hyperplane? How do we apply this to the margin?
x = x_p + d × w/||w||
w^Tx + w_0 = d||w||
t(w^Tx + w_0) / ||w|| ≥ |d|
Methematically, how do we maximise the margin?
By minimising ||w||
1/||w|| ∝ |d|
What is the support vector formulation?
Minimise: 0.5||w||^2
(makin the margin as large as possible)
Subject to the contraints:t_i(w^Tx_i + w_0) ≥ 1
(Every point is on the correct side, no point is on the hyperplane, this must be verified by every single point)
Which points matter for SVM?
Only the closest points
What is a data set limitation of support vector formulation? Why is this a problem? What can be do?
The equation will only give you the answer if the data set is exactly linearly separable but if there is noise in the data set then this may be a problem. We can make exceptions but we want as few as possible
What is it called when you allow some data points to be wrongly classified?
Slack
What is the equation for including slack in SVM? What is the additional constraint?
t_i (w^Tx_i + w_0 ) ≥ 1 - ξ_i
ξ_i ≥ 0
What happens to the value of ξ_i on different sides of the margin? What is it a measure of?
Correct side: ξ_i = 0
Incorrect side: ξ_i > 0
It is a measure of how much we are violating the constraint
What is the equation that we want to minimise to reduce the number of misclassifications (including the weight of violations)?
Minimise: ||w||^2 + C × ∑ξ_i
C = weight of the violations
> C = 0 means we are ignoring all misclassified points
> C = ∞ means that we have a hard magin and do not allow any misclassified points
What happens if the points are not linearly separable in their normal dimensions?
We can increase the number of dimensions so that they are linearly separable
How do we expand the number of dimensions?
We find a basis and expand each point. The equations remain the same but are instead in vector form