L17 - KNN and Weighted-KNN Flashcards
What type of model is KNN?
A supervised learning model for classification.
What core assumption does the model work on?
Data points within close proximity are likely to be of the same class.
What is the majority vote concept of KNN?
The new data point is classified based on the majority class of the surrounding data points.
Which distance metric is used to determine similarity?
Euclidian distance metric.
What is K? What type of parameter is it?
The coefficient of the algorithm, representing the number of surrounding points assessed in the majority vote.
What are the 2 main issues that can skew the classification performance?
Outliers - If a class has outliers close to a cluster of a different class, this can cause incorrect classification of new data.
Class imbalance - If one class count heavily outweighs another, this can cause incorrect classification.
What is the solution to Outliers and Class Imbalance?
Weighting data points by the inverse of their distance to the new data point.
Why is the selection of K so important?
Determines the number of neighbours to assess.
Chosen carefully to avoid under-fitting or overfitting.
Why should K be an odd number?
To avoid classification ties.
If there are a high number of outliers, should a high or low K be chosen? Give reason…
High K.
To compensate for the outliers by having a wider spread of data points to assess.
This will ensure nearby class clusters can be assessed, which will outweigh the outlier points.
What are the 2 main methods for choosing K? Explain each…
Incremental: Start with K = 1, and increment by 1. Perform a classification test data upon each incrementation to determine the classification performance with that K.
Square Root: K = the square root of all data points.
What is the purpose of applying Weighted KNN? How does it work?
Compensates for outliers and class imbalance by assigning a higher weight to nearer points.
All K points are assessed. Points of the same class have their weight summed. New data point is assigned to the class with the greatest weight.
What are some things to consider when using KNN or Weighted-KNN?
Training Data Size - Performance degradation occurs when training on large data sets. Complexity grows with training size.
Normalisation - All data should be normalised to be between 0 and 1.
Dimensionality - Both work better in lower dimensions. Thus feature space should be decreased. E.g though feature selection.
What are some advantages and disadvantages of KNN and WKNN?
Advantages:
- Simple to implement
- Adaptable
- Few hyper-parameters
Disadvantages:
- Computationally expensive on large data
- Doesn’t perform well on high dimensional data
- Prone to overfitting
What are the 2 hyper-parameters of KNN?
K
Distance Metric