Week 8 Flashcards

Question 1

Q

What is kNN

Answer

A

Supervised classification
Assumes similar data points will exist close to each other (similarity is captured by distance e.g. Euclidean)
For a given labelled data, the class of a new point is determined by the majority class of the k Nearest Neighbours (k is a hyperparameter)

Question 2

Q

kNN disadvantages

Answer

A

Susceptible to influence of outliers (if one class overlaps with another)
Susceptible to class imbalance (high k, bias if one class dominates)

Question 3

Q

How do we choose k?

Answer

A

Start with k=1 and predict on the test set, evaluate. Repeat and increase. k should be odd.

Question 4

Q

What is weighted kNN

Answer

A

Impact of nearer neighbours on the query point should be more than the further away points. (1/distance) and add up the scores for each class.

Question 5

Q

kNN vs other algorithms

Answer

A

Perform instance-based learning, experiences performance degradation with big training set.

Suitable for fewer features as low cost. (should perform feature selection first)

Normalisation of data must be performed as distance metrics are used.

Question 6

Q

What is a Support Vector Machine?

Answer

A

Uses classification algorithms for binary and multiclass classification problems

Performs better on text in terms of higher speed and better performance. Used to classify text and gene expressions

Support vectors from SVMs can categorise unlabeled data.

Question 7

Q

How do SVMs work?

Answer

A

Find a line that separates data points by a margin. Shortest distance between the observations and the threshold is called the margin. Points either side of the line are classified.

Question 8

Q

What happens if we choose a threshold that allows for misclassification?

Answer

A

Poor at training, good at classifying. Low variance, higher bias.

Question 9

Q

What are the margins called?

Answer

A

Soft margin when misclassifications are allowed
Hard margin when they are not.

Question 10

Q

What is a hyperplane?

Answer

A

A hyperplane in an n-dimensional Euclidean space is a flat, n – 1
dimensional susbset of that space that divides the space into
two disconnected parts.

Question 11

Q

Why are the data points in SVMs called support vectors

Answer

A

They are the data points that support or determine the decision boundary. We want to maximise this margin (optimisation).

Question 12

Q

How do SVMs separate data that is not linearly separable by a line?

Answer

A

Apply a transformation such as omega(x) = x^2 and add a second dimension to the feature space. Now they are separable.

Question 13

Q

What is the kernel trick for

Answer

A

Calculates the high dimensional relationships (mapping) without actually transforming the data, reducing the computation require for SVMs by avoiding the math.

Week 8 Flashcards

(13 cards)