5-KNN Flashcards
What is the KNN algorithm?
- Store all training data points
- Compute distance of test instance to all training data points
- Find K closest data points
4 . Compute concept based on nearest data points
How are data points represented in KNN?
Each point is a feature vector
What is hamming distance?
Number of differing elements in two strings of different elements. E.g. differences in nominal attributes once one-hot encoded
What is simple matching distance?
1 - number of matching features / total features
What is jaccard distance?
1 - | A ^ B | / |A U B|
If nominal features are one-hot encoded, sets are expressed as the elements encoded with 1
What is manhatten distance?
d(a,b) = sum from i to N of |a_i - b_i|
What is euclidean distance?
d(a,b) = sqrt(sum from i to N of (a_i - b_i)^2)
What is cosine distance?
d(a,b) = 1 - cos(a,b)
= 1 - sum(a_ib_i)/(sqrt(sum(a_i^2))sqrt(sum(b_i^2)))
What is normalised ranks?
For ordinal values:
1. Sort values and return rank
2. Map rank to evenly spaced values between 0 and 1, i.e. 1/4 if 4
3. Compare using distance function
How can label be chosen?
Majority voting
Inverse distance weighting (usually with small epsilon in case instances match): 1/di + e
Inverse linear distance: (d_max - d_i)/(d_max - d_min)
What is the impact of k in KNN?
Lower k generates jagged decision boundary. Higher k generates smooth decision boundary
How if more than K neighbours share same distance?
Random
Change distance matrix
How to break ties if two classes are equally as likely?
Avoid even K
Random tie break
Use class with highest prior probability
What are the pros of KNN?
It is intuitive
Supports classification and regression
No assumptions
No training phase
What are the cons of KNN?
Difficult to choose best distance function
Difficult to choose right K
Expensive with large data sets