KNN Flashcards

Question 1

Q

Effect of larger k on decision boundary

Answer

A

Smoother decision boundary
Reduce impact of overfitting
Increases the risk of ignoring important patterns

Question 2

Q

Effect of smaller k on decision boundary

Answer

A

More discriminating, will detect more patterns

- Might make model more susceptible to noise and / or outliers - higher risk of overfitting

Question 3

Q

The [larger / smaller] the dataset, the less important the differences between two choices for k becomes.

Question 4

Q

Strengths of KNN

Answer

A

Simple / easy to interpret
Makes no assumptions about the underlying data distribution
Training phase is very fast

Question 5

Q

Weakness of KNN

Answer

A

Does not produce a model
Selection of appropriate k is often arbitrary
Slow classification phase
Does not handle missing, outlier and nominal data well without pre-processing

Question 6

Q

KNN is a [parametric / non-parametric] algorithm

Answer

A

Non-parametric - makes no assumptions about underlying data distribution - as a result, the model is distributed from the data

Question 7

Q

Is KNN a lazy learning algorithm?

Answer

A

Yes - It doesn’t use the training data points to make any generalization

You expect little to no explicit training phase
The training phase is pretty fast
KNN keeps all the training data since they are needed during the testing phase

Question 8

Q

KNN is a [supervised / non-supervised] algorithm

Answer

A

Supervised - relies on input data that is labelled to learn a function that produces an appropriate output

Question 9

Q

Common distance definitions used in KNN

Answer

A

Euclidean distance and Manhattan distance. Minkowski distance is a generalization of both Euclidean and Manhattan distances.

Question 10

Q

When to set Manhattan distance over Euclidean distance as the distance metric?

Answer

A

Manhattan distance is preferred over Euclidean distance when we have a case of high-dimensionality

Question 11

Q

Def Euclidean distance

Answer

A

Measure of the true straight line distance between two points in Euclidean space

D_e(x, y) = [∑: i=1 to n ( x_i - y_i ) ^ 2] ^ ( 1 / 2 )

Question 12

Q

Def Manhattan distance

Answer

A

Measure of the sum of the absolute differences of their cartesian coordinates

D_m(x, y) = ∑: i=1 to n | x_i - y_i |

Question 13

Q

Pseudocode of KNN algorithm to predict the class

Answer

A

Load the classified data (training set)
Initialise the value of k

Load the new points:
- For each new_point_i:
- Calculate the distance between new_point_i and
each training data point
- Sort the calculated distances in ascending order
based on distance values and select the top k
rows
- Get the most frequent class of these rows via
majority vote
- Assign most frequent class as the predicted class
for new_point_i

Return array of predicted classes

Question 14

Q

Def Minkowski distance

Answer

A

Generalization of Manhattan / Euclidean distance

D_min(a, b) = [∑: i=1 to n ( | x_i - y_i | ^ (p) ) ] ^ ( 1 / p )

Question 15

Q

In KNN, a single observation can be thought of as a:

Answer

A

Vector in high-dimensional feature space.

A feature vector can be denoted as
x = such that x_d denotes the value of the dth feature of x

Question 16

Q

Why is it a good idea to use an odd number for k?

Answer

Study These Flashcards

A

To prevent draws, as class label is determined by majority vote.
Eg. 4 votes might result in a 2 class A - 2 class B

Question 17

Q

One implicit assumption of KNN algorithms that are very different from Decision Trees

Answer

Study These Flashcards

A

All features are equally important to class prediction
vs
Decision Trees - which aim to find a smaller subset of features which are most useful for classification

KNN Flashcards

(17 cards)