Lecture 7B: K Nearest Neighbor Flashcards

1
Q

K-Nearest Neighbors Algorithm

A

A super simple way to classify data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does it work in a nutshell?

A

Step 1: Start with a dataset with knows categories. Have them clustered into different categories
Step 2: Add a new cell with unknown category
Step 3: We classify the new cell looking at the nearest annotated cells. (i.e. the “nearest neighbours”.

If “K” in the “K-nearest neighbour” is equal to 1, then we only use the nearest neighbour to define the category.

If K = 11, we would use 11 nearest neighbors. etc..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

If K=11 and the new cell is between two (or more) categories…

A

we simply pick the category that “gets the most votes”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

If the new cell is right between two categories…

A

1) If K is odd, then we can avoid a lot of ties

2) if we still get a tied vote, we can flip a coin or decide not to assign the cell a category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Few thoughts on picking a value for “K”

A

There is no physical or biological way to determine the best value for “K”, so you may have to try our a few values before settling on one. Do this by pretending part of the training data is “unknown”

Low values for K (like K = 1 or K = 2) can be noisy and subject to the effects of outliers.

Large values for K smooth over things, but you don’t want K to be so large that a categort with only a few sample sin it will always be votted out by the other categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do we calculate the “distance” or “similarity” between the tp-be-predicted points and its “neighbours”?

A

The “Distance” Hyper-Parameter Metric consists of:

  • Euclidean Distance
  • Minkowski Distance
  • Manhattan Distance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Brute Force Algorithm

A
  • Calculate the distance between the test data and each row of training data
    We calculate the distance of (76,82) from each of the training data points using Euclidean method or any of the other four methods.
  • Sort the distance values calcualted in the prev step in ascending order.
  • Use the top k rows from the sorted list
    From the sorted distance list, our 5 nearest training points are: (74,84), (76,86) etc…
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Hyper Parameters of KNN

A
  • Brute Force: Computation of ALL euclidian distance pairs and vote/mean by the k nearest neighbors.
    For large sample size and large dimensions > VERY LONG TIME
  • K-D Tree: A “tree based” approach to reduce the computational inefficiencies of Brute Force approach. In other words, if point A is very distant from point B, and point B is very close to point C, then we know that points A and C are very distant, without having explicitly calculate their distance
    For large sample size and number of dimensions < 20, K-D Tree is much faster than Brute Force. But becomes inefficient as the dimensions grow larger
  • Ball Tree: to overcome the inefficiencies of K-D Tree for higher dimensions
    Nodes of the tree is a series of nesting Hyper-Sphere
    Can outperform K-D Tree for high dimension data, but actual performance can vary highly based on the structure of the training data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Special Note on Leaf Size

A
  • As discussed, for small sample sizes a brute force search can be more efficient than a tree-based query.
  • This fact is accounted for in the ball tree and KD tree by internally switching to brute force searches within leaf nodes.
  • The level of this switch can be specified with the parameterleaf_size.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to Choose: Based on Sample Size(N) and Dimensionality(D)

A

Smaller Dataset & Smaller Dimensionality (N < 30, D < 20): Brute Force
For larger data set (N > 30), but smaller Dimensionality ( D < 20): K-D Tree
For Larger Dataset and Larger Dimensionality: Ball Tree (up to certain point)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How to Choose: Based on Sparsity of Data Structure

A
  • Sparsity of data structure refers to what degree the data fills up the parameter space and different from sparse matrices
  • Sparsity does not affect the Brute Force query time
  • Ball Tree performs better that KD Tree with Sparser data structure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to Choose: Number of neighbors - K

A
  • Brute forcequery time is largely unaffected by the value of k
  • Ball treeandKD treequery time will become slower askincreases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to Choose: Model Construction / Training Time

A
  • Brute Force is fast
  • K-D Tree takes more time and resource
  • Ball Tree takes even more time & resource than K-D Tree
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

kNN Applications

A

Works well with question: “find items similar to this one”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Works well with what datasets…

A

That has a non-linear seperation of target variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Example applications:

A
  • Recommender System
  • Finding documents containing similar topics
  • Feature Extraction in Computer Vision problems, such as, face recognition
  • Fingerprint matching
  • Detect unusual pattern in credit card usage
  • In Data Analytics: Can be used for
    Imputing missing values
    Minority Oversampling of imbalanced data