Discriminant-based classifiers Flashcards

1
Q

What are the decision boundary (DB) and decision surface (DS) of a perceptron?

A
  • DB: y(x) = w.T * x +w0 = 0 (w is the weight and w0 is the bias)
  • DS: orthogonal to w (A and B on DS <=> w.T * (A - B) = 0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What point on a perceptron’s decision surface is closest to the origin? What is its distance to the origin?

A

x’ = α*w with α = - w0 / |w|²
|x’| = - w0 / |w|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the distance of a point to the decision boundary?

A

x = x⊥ + γ * w / |w| with γ = y(x) / |w|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the perceptron algortihm?

A

-initialize w randomly
-repeat until convergence (i.e. all datapoints are correctly classified):
{
| for each datapoint x(n), t(n):
| {
| | if y(x(n)) == t(n): do nothing
| | else: w = w + x(n)t(n)
| }
}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the issues of perceptron?

A
  • can’t converge if data is not linearly separable
  • no unique solution (in linearly separable cases)
  • convergence is slow when classes are close to each other
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to chose between the primal and the dual formulations of the SVM problem?

A

Primal: problem over the model parameters (O(M^3)
Dual: problem over the number of examples (O(N²)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are Karush-Kuhn-Tucker (KKT) conditions?

A
  • a_n ≥ 0
  • t(n) * y(x(n)) - 1 ≥ 0
  • a_n * (t(n) * y(x(n)) - 1) = 0
    i.e. : for every datapoint n, either a_n = 0 or t(n) * y(x(n)) = 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is calculated the prediction of a SVM?

A

y(x) = Σ(n in support vector set) an * t(n) * K(x(n), x) + b where K is the kernel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the Gaussian kernel formula?

A

K(x, x’) = exp( - |x - x’|² / 2σ² )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are hard margin SVMs and their pros and cons?

A

It is SVMs with infinite penalty for misclassified data points.

Pros:
- the max-margin solution is unique (in contrary to perceptron)

Cons:
- can only be used on linearly separable data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are soft margin SVMs and its pros and cons?

A

It is SVMs using slack variables.

Pros:
- can be used on non-liearly separable data
- can be adapted using the cost parameter C (hard margin SVMs are soft margin SVMs with an infinite C, i.e. no error is allowed)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are slack variables and the associated error function?

A

It’s a set of variables (ξn). For each example n:
- ξn = 0 if n is well classified and outside the correct margin boundary
- 0 ≤ ξn ≤ 1 if n is well classified and inside the correct margin boundary
- ξn > 1 if n is misclassified

The new error function is C * Σ ξn + |w|² / 2, where C is the cost parameter, which controls how much misclassified points penalize the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the SVM objective function?

A

Σ E_SV( y(n) * t(n) ) + |w|² / 2C,
where E_SV is the Hinge loss : E_SV(x) = max(0, 1-x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the differences between parametric and non-parametric methods?

A

parametric vs non-parametric:
- finite vs infinite nb of param.
- nb of param. is independant vs dependant on the dataset size

Examples:
- GDA, logistic regression, primal SVM are parametric methods
- k-NN, dual SVM are non-parametric methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are precision, recall and F1 score?

A

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2*P*R / (P + R)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What’s Hunt’s algorithm for decision trees?

A

Apply recursively on each (sub)dataset Dt:
- If all Dt records belong to the same class, then ‘t’ is a leaf node labelled with this class
- If Dt is empty, then ‘t’ is a leaf node labelled with default classs
- If Dt contains records of different classes, use an attribute test to split Dt

17
Q

How to determine the best split for a Decision Tree?

A

The best split is the one that yields the most homogeneous leaf nodes (i.e. each leaf node has a clear majority class).

18
Q

What is the Gini index?

A

For a node t:
Gini(t) = 1 - Σ P(j|t) ²
For a split, it’s the weighted average of the gini index for each node, i.e. Σ ni / n * Gini(i), where ni is the number of records at node i.

19
Q

What is the entropy of a split?

A

Entropy(t) = - Σ P(j|t) * log[P(j|t)]

20
Q

What is the Gain ratio?

A

Gain ratio = Gain / Entropy(ni/n)

21
Q

What is (rule) post-pruning?

A

It’s letting a decision tree overfit and then prune it (as opposed to stopping the tree earlier).
Rule post-pruning is converting the decision tree to a set of decision rules before pruning it. After rule post-pruning, the original tree can’t be recover.