Support Vector Machines Flashcards
How do you decide which resultant classification is best? Why does this occur?
There are multiple possible separation boundaries, each represented by local minima but there is often a minima that is better than others because it:
> Correctly classifies all points
> Is as far away from all points as possible
[Picture 9]
What are the classes that we use in support vector machines?
-1 and 1
How can we express that a point is correctly classified (using an equation)?
t(w^Tx + w_0) ≥ 0
What happens when t(w^Tx + w_0) = 0 ?
This means that the point is on the decision boundary. This means that the classifier is undecided. This situation should be avoided
How can we express that all points are correctly classified and none are on the decision boundary? how can this equation be transformed?
t(w^Tx + w_0) ≥ ε
ε > 0
We can divide both sides by ε and set the inequality to any constant that we want (example 1)
t(w^Tx + w_0) ≥ 1
Why can the equations: t(w^Tx + w_0) ≥ ε ε > 0 become: t(w^Tx + w_0) ≥ p with p being any positive number?
Becayse we can just rescale the weights to whaever we want
What is the margin?
The distance between the closest point to the separation boundary and the boundary itself
[Picture 10]
What do we want from the magin?
We want to maximise it to get the best classification
How do we calculate the distance of points to the hyperplane? How do we apply this to the margin?
x = x_p + d × w/||w||
w^Tx + w_0 = d||w||
t(w^Tx + w_0) / ||w|| ≥ |d|
Methematically, how do we maximise the margin?
By minimising ||w||
1/||w|| ∝ |d|
What is the support vector formulation?
Minimise: 0.5||w||^2
(makin the margin as large as possible)
Subject to the contraints:t_i(w^Tx_i + w_0) ≥ 1
(Every point is on the correct side, no point is on the hyperplane, this must be verified by every single point)
Which points matter for SVM?
Only the closest points
What is a data set limitation of support vector formulation? Why is this a problem? What can be do?
The equation will only give you the answer if the data set is exactly linearly separable but if there is noise in the data set then this may be a problem. We can make exceptions but we want as few as possible
What is it called when you allow some data points to be wrongly classified?
Slack
What is the equation for including slack in SVM? What is the additional constraint?
t_i (w^Tx_i + w_0 ) ≥ 1 - ξ_i
ξ_i ≥ 0