Data Mining - Chapter 7 (k-Nearest Neighbors) Flashcards
What is k-Nearest Neighbor?
It is an algorithm that can be used for classification of a categorical outcome or prediction of a numerical outcome.
-> It works by finding similar records in the training data to classify the new record to their group.
What is voting in k-nearest neighbor?
If you have a specific k of neighbours, you look what their category is. Based on the majority of their outcomes, you set the category of the new record.
What kind of method is the k-nearest neighbor?
It is a nonparametric method. That means it does not make any assumptions about the form of the relationship between the dependent variable and the predictor variables.
-> It looks at similarities between the predictor values of the records in the dataset.
What is the process of the k-Nearest Neighbor?
- Determine the item’s neighbors
- Choosing the number of neighbors you want to include - choosing the k
- Computing the classification.
How do you determine a records neighbors?
It is based on similarity / closeness between records.
This is often called the distance between records.
How do you note the distance between records?
Record i = ri
Record j = rj
Distance between the two records: dij
-> So you check the distance between two records (rows). They can have multiple predictor variables per record.
What are the properties of distances?
P1: The distance is non-negative: dij > 0
P2: Self-proximity : Dii = 0
P3: Symmetry: dij = dji
P4: Triangle inequality: dij <= dik + dkj
(distance between any pair cannot exceed the sum of distances between the other two pairs)
What is the Euclidean distance and how does it work?
Most popular distance measure for numerical values.
dij = √ ((Xi1 - Xj1)^2 + (Xi2 - Xj2)^2 +…+ (Xip -Xjp)^2)
What are some remarks on the Euclidean distance?
- Highly scale dependent
Units of one variable can have a huge influence on the results.
-> Therefore it is sometimes needed to standardize the values. - Sensitive to outliers.
What is the Manhattan distance?
A more robust distance than the Euclidean distance. It looks at the absolute differences instead of the squared differences.
dij = |Xi1 - Xj1| + |Xi2 - Xj2| +…+ |Xip - Xjp|
What happens if K is too low?
We may be fitting to the noise (useless data) in the dataset. This can lead to overfitting.
What happens if K is too high?
You miss out on capturing the local structure of the data –> you are just gonna average too much, reducing meaning of the classifications.
What happens if K = n ?
We assign all the records to the majority class –> oversmoothing.
How do you decide on the number of k in a python model?
- Make the nearest K model.
- Make it for all the K’s you want to try
- For every model, compute the accuracy score (accuracy_score(test_y, predict.model(test_x)))
What changes if you use K nearest neighbors for numerical outcomes?
Everything stays the same. Except you do not determine class through majority voting but by taking the average outcome value of the k-nearest neighbors.