KNN Flashcards
Effect of larger k on decision boundary
- Smoother decision boundary
- Reduce impact of overfitting
- Increases the risk of ignoring important patterns
Effect of smaller k on decision boundary
- More discriminating, will detect more patterns
- Might make model more susceptible to noise and / or outliers - higher risk of overfitting
The [larger / smaller] the dataset, the less important the differences between two choices for k becomes.
Larger
Strengths of KNN
- Simple / easy to interpret
- Makes no assumptions about the underlying data distribution
- Training phase is very fast
Weakness of KNN
- Does not produce a model
- Selection of appropriate k is often arbitrary
- Slow classification phase
- Does not handle missing, outlier and nominal data well without pre-processing
KNN is a [parametric / non-parametric] algorithm
Non-parametric - makes no assumptions about underlying data distribution - as a result, the model is distributed from the data
Is KNN a lazy learning algorithm?
Yes - It doesn’t use the training data points to make any generalization
- You expect little to no explicit training phase
- The training phase is pretty fast
- KNN keeps all the training data since they are needed during the testing phase
KNN is a [supervised / non-supervised] algorithm
Supervised - relies on input data that is labelled to learn a function that produces an appropriate output
Common distance definitions used in KNN
Euclidean distance and Manhattan distance. Minkowski distance is a generalization of both Euclidean and Manhattan distances.
When to set Manhattan distance over Euclidean distance as the distance metric?
Manhattan distance is preferred over Euclidean distance when we have a case of high-dimensionality
Def Euclidean distance
Measure of the true straight line distance between two points in Euclidean space
D_e(x, y) = [∑: i=1 to n ( x_i - y_i ) ^ 2] ^ ( 1 / 2 )
Def Manhattan distance
Measure of the sum of the absolute differences of their cartesian coordinates
D_m(x, y) = ∑: i=1 to n | x_i - y_i |
Pseudocode of KNN algorithm to predict the class
- Load the classified data (training set)
- Initialise the value of k
Load the new points:
- For each new_point_i:
- Calculate the distance between new_point_i and
each training data point
- Sort the calculated distances in ascending order
based on distance values and select the top k
rows
- Get the most frequent class of these rows via
majority vote
- Assign most frequent class as the predicted class
for new_point_i
- Return array of predicted classes
Def Minkowski distance
Generalization of Manhattan / Euclidean distance
D_min(a, b) = [∑: i=1 to n ( | x_i - y_i | ^ (p) ) ] ^ ( 1 / p )
In KNN, a single observation can be thought of as a:
Vector in high-dimensional feature space.
A feature vector can be denoted as
x = such that x_d denotes the value of the dth feature of x