Lecture 7B: K Nearest Neighbor Flashcards

Question 1

Q

K-Nearest Neighbors Algorithm

Answer

A

A super simple way to classify data

Question 2

Q

How does it work in a nutshell?

Answer

A

Step 1: Start with a dataset with knows categories. Have them clustered into different categories
Step 2: Add a new cell with unknown category
Step 3: We classify the new cell looking at the nearest annotated cells. (i.e. the “nearest neighbours”.

If “K” in the “K-nearest neighbour” is equal to 1, then we only use the nearest neighbour to define the category.

If K = 11, we would use 11 nearest neighbors. etc..

Question 3

Q

If K=11 and the new cell is between two (or more) categories…

Answer

A

we simply pick the category that “gets the most votes”

Question 4

Q

If the new cell is right between two categories…

Answer

A

1) If K is odd, then we can avoid a lot of ties

2) if we still get a tied vote, we can flip a coin or decide not to assign the cell a category

Question 5

Q

Few thoughts on picking a value for “K”

Answer

A

There is no physical or biological way to determine the best value for “K”, so you may have to try our a few values before settling on one. Do this by pretending part of the training data is “unknown”

Low values for K (like K = 1 or K = 2) can be noisy and subject to the effects of outliers.

Large values for K smooth over things, but you don’t want K to be so large that a categort with only a few sample sin it will always be votted out by the other categories

Question 6

Q

How do we calculate the “distance” or “similarity” between the tp-be-predicted points and its “neighbours”?

Answer

A

The “Distance” Hyper-Parameter Metric consists of:

Euclidean Distance
Minkowski Distance
Manhattan Distance

Question 7

Q

Brute Force Algorithm

Answer

A

Calculate the distance between the test data and each row of training data
We calculate the distance of (76,82) from each of the training data points using Euclidean method or any of the other four methods.
Sort the distance values calcualted in the prev step in ascending order.
Use the top k rows from the sorted list
From the sorted distance list, our 5 nearest training points are: (74,84), (76,86) etc…

Question 8

Q

Hyper Parameters of KNN

Answer

A

Brute Force: Computation of ALL euclidian distance pairs and vote/mean by the k nearest neighbors.
For large sample size and large dimensions > VERY LONG TIME
K-D Tree: A “tree based” approach to reduce the computational inefficiencies of Brute Force approach. In other words, if point A is very distant from point B, and point B is very close to point C, then we know that points A and C are very distant, without having explicitly calculate their distance
For large sample size and number of dimensions < 20, K-D Tree is much faster than Brute Force. But becomes inefficient as the dimensions grow larger
Ball Tree: to overcome the inefficiencies of K-D Tree for higher dimensions
Nodes of the tree is a series of nesting Hyper-Sphere
Can outperform K-D Tree for high dimension data, but actual performance can vary highly based on the structure of the training data

Question 9

Q

Special Note on Leaf Size

Answer

A

As discussed, for small sample sizes a brute force search can be more efficient than a tree-based query.
This fact is accounted for in the ball tree and KD tree by internally switching to brute force searches within leaf nodes.
The level of this switch can be specified with the parameterleaf_size.

Question 10

Q

How to Choose: Based on Sample Size(N) and Dimensionality(D)

Answer

A

Smaller Dataset & Smaller Dimensionality (N < 30, D < 20): Brute Force
For larger data set (N > 30), but smaller Dimensionality ( D < 20): K-D Tree
For Larger Dataset and Larger Dimensionality: Ball Tree (up to certain point)

Question 11

Q

How to Choose: Based on Sparsity of Data Structure

Answer

A

Sparsity of data structure refers to what degree the data fills up the parameter space and different from sparse matrices
Sparsity does not affect the Brute Force query time
Ball Tree performs better that KD Tree with Sparser data structure

Question 12

Q

How to Choose: Number of neighbors - K

Answer

A

Brute forcequery time is largely unaffected by the value of k
Ball treeandKD treequery time will become slower askincreases

Question 13

Q

How to Choose: Model Construction / Training Time

Answer

A

Brute Force is fast
K-D Tree takes more time and resource
Ball Tree takes even more time & resource than K-D Tree

Question 14

Q

kNN Applications

Answer

A

Works well with question: “find items similar to this one”

Question 15

Q

Works well with what datasets…

Answer

A

That has a non-linear seperation of target variable

Question 16

Q

Example applications:

Answer

Study These Flashcards

A

Recommender System
Finding documents containing similar topics
Feature Extraction in Computer Vision problems, such as, face recognition
Fingerprint matching
Detect unusual pattern in credit card usage
In Data Analytics: Can be used for
Imputing missing values
Minority Oversampling of imbalanced data

Lecture 7B: K Nearest Neighbor Flashcards

(16 cards)