Lecture 7B: K Nearest Neighbor Flashcards
K-Nearest Neighbors Algorithm
A super simple way to classify data
How does it work in a nutshell?
Step 1: Start with a dataset with knows categories. Have them clustered into different categories
Step 2: Add a new cell with unknown category
Step 3: We classify the new cell looking at the nearest annotated cells. (i.e. the “nearest neighbours”.
If “K” in the “K-nearest neighbour” is equal to 1, then we only use the nearest neighbour to define the category.
If K = 11, we would use 11 nearest neighbors. etc..
If K=11 and the new cell is between two (or more) categories…
we simply pick the category that “gets the most votes”
If the new cell is right between two categories…
1) If K is odd, then we can avoid a lot of ties
2) if we still get a tied vote, we can flip a coin or decide not to assign the cell a category
Few thoughts on picking a value for “K”
There is no physical or biological way to determine the best value for “K”, so you may have to try our a few values before settling on one. Do this by pretending part of the training data is “unknown”
Low values for K (like K = 1 or K = 2) can be noisy and subject to the effects of outliers.
Large values for K smooth over things, but you don’t want K to be so large that a categort with only a few sample sin it will always be votted out by the other categories
How do we calculate the “distance” or “similarity” between the tp-be-predicted points and its “neighbours”?
The “Distance” Hyper-Parameter Metric consists of:
- Euclidean Distance
- Minkowski Distance
- Manhattan Distance
Brute Force Algorithm
- Calculate the distance between the test data and each row of training data
We calculate the distance of (76,82) from each of the training data points using Euclidean method or any of the other four methods. - Sort the distance values calcualted in the prev step in ascending order.
- Use the top k rows from the sorted list
From the sorted distance list, our 5 nearest training points are: (74,84), (76,86) etc…
Hyper Parameters of KNN
- Brute Force: Computation of ALL euclidian distance pairs and vote/mean by the k nearest neighbors.
For large sample size and large dimensions > VERY LONG TIME - K-D Tree: A “tree based” approach to reduce the computational inefficiencies of Brute Force approach. In other words, if point A is very distant from point B, and point B is very close to point C, then we know that points A and C are very distant, without having explicitly calculate their distance
For large sample size and number of dimensions < 20, K-D Tree is much faster than Brute Force. But becomes inefficient as the dimensions grow larger - Ball Tree: to overcome the inefficiencies of K-D Tree for higher dimensions
Nodes of the tree is a series of nesting Hyper-Sphere
Can outperform K-D Tree for high dimension data, but actual performance can vary highly based on the structure of the training data
Special Note on Leaf Size
- As discussed, for small sample sizes a brute force search can be more efficient than a tree-based query.
- This fact is accounted for in the ball tree and KD tree by internally switching to brute force searches within leaf nodes.
- The level of this switch can be specified with the parameterleaf_size.
How to Choose: Based on Sample Size(N) and Dimensionality(D)
Smaller Dataset & Smaller Dimensionality (N < 30, D < 20): Brute Force
For larger data set (N > 30), but smaller Dimensionality ( D < 20): K-D Tree
For Larger Dataset and Larger Dimensionality: Ball Tree (up to certain point)
How to Choose: Based on Sparsity of Data Structure
- Sparsity of data structure refers to what degree the data fills up the parameter space and different from sparse matrices
- Sparsity does not affect the Brute Force query time
- Ball Tree performs better that KD Tree with Sparser data structure
How to Choose: Number of neighbors - K
- Brute forcequery time is largely unaffected by the value of k
- Ball treeandKD treequery time will become slower askincreases
How to Choose: Model Construction / Training Time
- Brute Force is fast
- K-D Tree takes more time and resource
- Ball Tree takes even more time & resource than K-D Tree
kNN Applications
Works well with question: “find items similar to this one”
Works well with what datasets…
That has a non-linear seperation of target variable
Example applications:
- Recommender System
- Finding documents containing similar topics
- Feature Extraction in Computer Vision problems, such as, face recognition
- Fingerprint matching
- Detect unusual pattern in credit card usage
- In Data Analytics: Can be used for
Imputing missing values
Minority Oversampling of imbalanced data