Module 3_ 2. Classification And Regression Models_ K-Nearest Neighbors Flashcards
Explain KNN using an example.
Make the drawing in the rough notes.
Given xq —–> Find yq ∈ {0,1}
- Find K-nearest points to xq in D
let k=3
{x1, x2, x3} ——> 3NN to xq
↓ ↓ ↓
y1 y2 y3 - {y1,y2,y3} —> Majority vote
What are the failure cases in KNN?
- xq is far away from points in D
- jumble of +ve/-ve points (no useful information)
How to calculate Euclidean distance?
x1 = (x11, x12)
x2 = (x21, x22)
d = len of shortest line from x1 to x2
d = √((x21 - x11)^2 + (x22 - x12)^2) = ||x1 - x2||base2
Euclidean distance :
||x1 - x2||base2 = (Σ(x1i - x2i)^2)^(1/2)
||x1||base2 = Euclidean distance of x1 from origin
= (Σ x1i^2)^(1/2)
Also
||(x1 - x2)||base2 ——> L2-norm of vector (x1 -x2)
How to calculate Manhattan distance?
Manhattan distance ————–>Σ |x1i - x2i|
||(x1 - x2)||base1 ——> L1-norm of vector (x1 -x2)
||x1||base1 = Σ |x1i|
How to calculate Minkowski distance?
Minkowski distance:
||x1 - x2||p = (Σ |x1i - x2i|^p)^(1/p) ——-> p != 0 and p > 0
p = 2 ——> Minkowski distance —–> Euclidean distance
p = 1 ——> Minkowski distance —–> Manhattan distance
- Distance is between 2 points and norms are always for a vector
How to calculate Hamming distance?
x1 = [ 0, 1, 1, 0, 1, 0, 0 ]
↓ ↓ ↓
x1 = [ 1, 0, 1, 0, 1, 0, 1 ]
Hamming-distance(x1,x2) = # of locations/dimensions where binary vectors differ = 3
Strings:
x1 = a, b, c, a, d, e, f, g, h, i, k
↓ ↓ ↓ ↓
x2 = a, c, b, a, d, e, g, f, h, i, k
Hamming-distance(x1,x2) = 4
Gene Code Sequence:
x1 = A A G T C T C A G
↓ ↓ ↓ ↓
x2 = A G A T C T C G A
Hamming-distance(x1,x2) = 4
Explain Cosine distance and Cosine similarity.
Intuitively,
similarity ↓ ———> distance ↑
similarity ↑ ———> distance ↓
By definition,
1 - cos-sim(x1, x2) = cos-dist(x1, x2)
d = Euclidean distance
cos-sim(x1, x2) = cos θ
θ => angle between x1 and x2
Draw in rough notes
cos-sim(x1, x2) = cos θ
cos-dist(x1, x2) = 1 - cos θ
cos-sim(x1, x3) = 1 (since θx1,x3 = 0 degree)
cos-dist(x1, x3) = 1 - 1 = 0
Euc. distance d13 > d12 but cosine-distance cos-dist13 < cos-dist12
If x1 and x2 are unit vectors and θ is the angle between x1 and x2, then what is the relationship between euclidean distance and cosine similarity of x1 and x2?
We know,
cos θ = (x1.x2)/||x1||||x2||
If x1 and x2 are unit vectors,
||x1|| = ||x2|| = 1
Therefore,
cos θ = x1.x2
Euclidean distance(x1, x2) = 2 * (1 - cos θ)
= 2 * cos-dist(x1, x2)
What is the time and space complexity of KNN?
Time complexity ——> O(nd)
Space complexity ——> O(nd)
A LOT
What are the limitations of KNN?
- Space complexity
Eg.
n ≈ 364k and d ≈ 100k —-> 36400M ≈ 36GB —–> 64 GB RAM
↑ space complexity - TIme complexity
36 billion computations —-> few seconds
Therefore can’t be used in low latency applications
Draw the decision surfaces for KNN with different values and also talk about the impact of different types of values.
Draw in rough notes
1. K = 1
- Non-smooth
- Overfitting
2. K = 5
- Smooth
3. K = n
- Underfitting
- Everything becomes majority class
What is the need for cross validation?
k = 1, 2, 3, ……, n ——>Optimal value of k will be somewhere in middle
↓ ↓
overfit underfit
One idea : Trying diff values of k and training on Dtrain to check accuracy in Dtest
Draw accuracy vs k graph
So, we can conclude k = 6 gives the best accuracy on Dtest when using Dtrain as training data.
Small Problem : The sole objective of ML is to perform well on UNSEEN DATA. —-> Generalization
So we can’t say accuracy of model is 96% on unseen data
Cross Validation :
Dn ——–> Dtrain ——> NN (60%)
——–> Dcv ——> Used to determin the right ‘k’ value (20%)
——–> Dtest ——> apply f to get accuracy on unseen data (20%)
Now, if Dtest gives 93% accuracy on optimal value of ‘k’ found on Dcv, we can say accuracy of model is 93% on unseen data.
What is K-fold cross validation?
We saw in Cross Validation :
Dn ——–> Dtrain ——> (60%)
——–> Dcv ——> (20%)
——–> Dtest ——> (20%)
Problem : Only 60% data for NN —–> Is there a way to use 80% of data to compute NN because more data means better algo
K-fold CV:
1. Dn ——-> Dtrain (80%) ——–> for finding K and computing NN
——–> Dtest (20%) ——> unseen data
2. [ D1, D2, D3, D4 ] —–> Randomly 4 equal parts —–> Each consists of 20% of the data
3. Here K’ = 4-fold CV
| | Train | CV |accuracy on CV|
| K = 1 | D1 D2 D3 | D4 | a4 |
| K = 1 | D1 D2 D4 | D3 | a3 |
| K = 1 | D1 D3 D4 | D2 | a2 |
| K = 1 | D2 D3 D4 | D1 | a1 |
| K = 2 | D1 D2 D3 | D4 | a4’ |
| K = 2 | D1 D2 D4 | D3 | a3’ |
| K = 2 | D1 D3 D4 | D2 | a2’ |
| K = 2 | D2 D3 D4 | D1 | a1’ |
:
:
:
:
avg(a1, a2, a3, a4) = abasek=1 ——> Accuracy on K-NN for K=1 on CV
Let abasek=3 —–> best k ——> 3-NN ——> best avg accuracy on 4 fold-cv —-> Train Dtrain for nn with k =3 ——–> then after cv check on Dtest ——–> 93% on unseen data
What is the right K’?
Rule of thumb —> K’ = 10 ——> 10-fold cv
Time it takes to compute optimal/best k in K-NN increases by k’ times if we use k’-fold CV
Given a value of ‘k’ how to determine whether it underfits or overfits?
Graphical Method:
Accuracy = (# correctly classified pts)/(Total # pts)
error = 1 - accuracy
Draw graph in rough notes
What is time based splitting?
—> works only if we have timestamp
Sort Dn in ascending order of time
In datasets like amazon food reviews, a time based splitting is better than random splitting because with time products and their reviews change.
TBS:
T-90 T-30 T-15 T
| NN | k | |
|————————-|————|————|
| Train | CV | Test |
- Whenever time is available and if things/behaviour or data changes over time then time based splitting is preferable over random splitting
Can we use K-NN for regression. If yes, how?
YES
1. Given xq find k-nearest neighbours —> (x1,y1),(x2,y2),…,(xk,yk)
2. yq <——— y1, y2, … , yk ——> R
yq = mean(yi)
OR
yq = median(yi)
—> simple extension/modification of classification
—> median is less prone to outliers
Explain weighted K-NN.
Eg. 5-NN where k = 5
2 —-> -ve and 3 —> +ve
wi
x1 , y1 —> d1 = 0.1 —-> -ve | 10 |
x2 , y2 —> d2 = 0.2 —-> -ve | 5 |
x3 , y3 —> d3 = 1.0 —-> +ve | 1 |
x4 , y4 —> d4 = 2.0 —-> +ve | 0.5 |
x5 , y5 —> d5 = 4.0 —-> +ve | 0.25 |
wi = 1/di ——> Simplest weight function
As di↑, wi↓,
di↓, wi↑
Explain Voronoi Diagram.
Voronoi diagram is a partition of a plane into regions close to each of a given set of objects
Draw in rough notes
Explain Binary Search Tree. Is it better than K-NN? If yes, why?
K-NN —> Time complexity —> O(n) —-> If d and k is small
—> Space complexity —> O(n)
Is there a way —–> Make time complexity from O(n) to O(logn)
eg. n = 1024 ——-> log(n) = 10
Binary Search Tree (BST):
Problem: Given a sorted array, find a number is present in array or not.
1. Construct a BST
2. Find the number in array based on conditions of the BST
BST Time complexity —-> O(logn) —->depth of the tree
How to build KD-Tree which is basically a type of BST(short for k-dimensional tree):
1. Pick x-axis
2. Project points onto x-axis
3. Compute median
4. Split data using median
5. Alternate between axis and follow the same steps
What are the limitations of KD-Tree?
Limitations:
1.When d is not small (i.e. 2,3,4,5)
—–> d =10 —-> 2^d = 1024 # of adjacent cells
—–> d =20 —-> 2^d = 1 Million # of adjacent cells
When d is 10,11,12,….. time complexity increases dramatically
It is no more O(logn) but it becomes O(2^dlogn)
If d is large(eg. 2^d ≈ n) then it becomes O(nlogn) —> worst than brute force
2. Even when d is small O(logn) holds when data is uniformly distributed else it will move towards O(n)
Explain Hashing vs Locality Sensitive Hashing (LSH) with example.
Eg. array a = [ 2, 1, 5, 6, 7, 5, 8] —-> unordered
Q. Find if 5 exists in a.
Hashing:
x —–> h(x) i.e. hashing function —-> returns the value
If we had done sequential/linear search then the time complexity for that would be O(n)
In case of hashing the time complexity —–> O(1)
Locality Sensitive Hashing(LSH):
—-> Type of hashing used to find the points in the same neighborhood as xq given xq
eg. xq —–> h(xq) ——> x1,x2,x4,x7,…. (returns points in the same neighborhood)
Explain LSH for cosine similarity.
Draw diagram in rough notes
cos-sim(x1,x2) = cosθ
cos-sim(x1,x3) = cos0 = 1
As θ↑, x1 and x2 are angularly separated —-> distance↑ —–> cosine↓ ——> similarity↓
cosine-simialrity ≈ angular-similarity
Draw diagram in rough notes
θ12 ~ 0 —-> x1 and x2 are similar/close.
h(x1)=h(x2)=key —–> both are in same neighborhood
LSH —-> Randomized algo
—-> Will not always give you correct answers
—-> Will try to give correct answer with high probability
Steps:
1. Draw a random hyperplane (π1)
w1Tx = 0 —–> w1 : normal to the plane
w1 is a d-dim random vector with indexing from 0 to d-1
w1 = numpy.random.normal(0,1,d)
** w1Tx1 >= 0 and w1Tx2 <= 0 —–> x1 is +ve point and x2 is -ve point
- Lets say we draw two more such planes w2 and w3
Draw diagram in rough notes
m = # of hyperplanes
Point belonging between these slice will have a hash value based on which side of each plane they lie on eg. h(x7) = (+1,-1,-1)
Here +1 = sign(w1Tx), -1 = sign(w2Tx), -1 = sign(w3Tx)
To calculate all this,
Time complexity ~ O(mdn)
Space complexity ~ O(nd)
Given a query point xq,
1. Calculate h(xq) i.e. the hash value as shown earlier using signs
2. Pass h(xq) as the key in hash table
—–> d.get(h(xq)) ——-> returns [x1,x2,x4,x7]
Time complexity for querying ~ O(md+n’d) –>n’=#elements in bucket
If n’ is small or n’≈m then —-> O(md)
Typically m≈log(n) ——-> m = # of hyperplanes
Therefore it becomes O(d*logn)
What is the problem with LSH with cos-sim? How to overcome it?
- Could miss a NN in cos-sim as your measure