5-KNN Flashcards

1
Q

What is the KNN algorithm?

A
  1. Store all training data points
  2. Compute distance of test instance to all training data points
  3. Find K closest data points
    4 . Compute concept based on nearest data points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How are data points represented in KNN?

A

Each point is a feature vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is hamming distance?

A

Number of differing elements in two strings of different elements. E.g. differences in nominal attributes once one-hot encoded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is simple matching distance?

A

1 - number of matching features / total features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is jaccard distance?

A

1 - | A ^ B | / |A U B|

If nominal features are one-hot encoded, sets are expressed as the elements encoded with 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is manhatten distance?

A

d(a,b) = sum from i to N of |a_i - b_i|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is euclidean distance?

A

d(a,b) = sqrt(sum from i to N of (a_i - b_i)^2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is cosine distance?

A

d(a,b) = 1 - cos(a,b)
= 1 - sum(a_ib_i)/(sqrt(sum(a_i^2))sqrt(sum(b_i^2)))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is normalised ranks?

A

For ordinal values:
1. Sort values and return rank
2. Map rank to evenly spaced values between 0 and 1, i.e. 1/4 if 4
3. Compare using distance function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can label be chosen?

A

Majority voting

Inverse distance weighting (usually with small epsilon in case instances match): 1/di + e

Inverse linear distance: (d_max - d_i)/(d_max - d_min)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the impact of k in KNN?

A

Lower k generates jagged decision boundary. Higher k generates smooth decision boundary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How if more than K neighbours share same distance?

A

Random

Change distance matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to break ties if two classes are equally as likely?

A

Avoid even K

Random tie break

Use class with highest prior probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the pros of KNN?

A

It is intuitive
Supports classification and regression
No assumptions
No training phase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the cons of KNN?

A

Difficult to choose best distance function
Difficult to choose right K
Expensive with large data sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is lazy learning?

A

Lazy learning are also known as instance-based learning.

Training data is stored and test instances are compared to the test data. There is no learning.

17
Q

What is eager learning?

A

Model is trained on training data using labelled instances. The model generalises from seen data to unseen data. The model then predicts the labels for test instances.