Module 3_ 2. Classification And Regression Models_ K-Nearest Neighbors Flashcards
Explain KNN using an example.
Make the drawing in the rough notes.
Given xq —–> Find yq ∈ {0,1}
- Find K-nearest points to xq in D
let k=3
{x1, x2, x3} ——> 3NN to xq
↓ ↓ ↓
y1 y2 y3 - {y1,y2,y3} —> Majority vote
What are the failure cases in KNN?
- xq is far away from points in D
- jumble of +ve/-ve points (no useful information)
How to calculate Euclidean distance?
x1 = (x11, x12)
x2 = (x21, x22)
d = len of shortest line from x1 to x2
d = √((x21 - x11)^2 + (x22 - x12)^2) = ||x1 - x2||base2
Euclidean distance :
||x1 - x2||base2 = (Σ(x1i - x2i)^2)^(1/2)
||x1||base2 = Euclidean distance of x1 from origin
= (Σ x1i^2)^(1/2)
Also
||(x1 - x2)||base2 ——> L2-norm of vector (x1 -x2)
How to calculate Manhattan distance?
Manhattan distance ————–>Σ |x1i - x2i|
||(x1 - x2)||base1 ——> L1-norm of vector (x1 -x2)
||x1||base1 = Σ |x1i|
How to calculate Minkowski distance?
Minkowski distance:
||x1 - x2||p = (Σ |x1i - x2i|^p)^(1/p) ——-> p != 0 and p > 0
p = 2 ——> Minkowski distance —–> Euclidean distance
p = 1 ——> Minkowski distance —–> Manhattan distance
- Distance is between 2 points and norms are always for a vector
How to calculate Hamming distance?
x1 = [ 0, 1, 1, 0, 1, 0, 0 ]
↓ ↓ ↓
x1 = [ 1, 0, 1, 0, 1, 0, 1 ]
Hamming-distance(x1,x2) = # of locations/dimensions where binary vectors differ = 3
Strings:
x1 = a, b, c, a, d, e, f, g, h, i, k
↓ ↓ ↓ ↓
x2 = a, c, b, a, d, e, g, f, h, i, k
Hamming-distance(x1,x2) = 4
Gene Code Sequence:
x1 = A A G T C T C A G
↓ ↓ ↓ ↓
x2 = A G A T C T C G A
Hamming-distance(x1,x2) = 4
Explain Cosine distance and Cosine similarity.
Intuitively,
similarity ↓ ———> distance ↑
similarity ↑ ———> distance ↓
By definition,
1 - cos-sim(x1, x2) = cos-dist(x1, x2)
d = Euclidean distance
cos-sim(x1, x2) = cos θ
θ => angle between x1 and x2
Draw in rough notes
cos-sim(x1, x2) = cos θ
cos-dist(x1, x2) = 1 - cos θ
cos-sim(x1, x3) = 1 (since θx1,x3 = 0 degree)
cos-dist(x1, x3) = 1 - 1 = 0
Euc. distance d13 > d12 but cosine-distance cos-dist13 < cos-dist12
If x1 and x2 are unit vectors and θ is the angle between x1 and x2, then what is the relationship between euclidean distance and cosine similarity of x1 and x2?
We know,
cos θ = (x1.x2)/||x1||||x2||
If x1 and x2 are unit vectors,
||x1|| = ||x2|| = 1
Therefore,
cos θ = x1.x2
Euclidean distance(x1, x2) = 2 * (1 - cos θ)
= 2 * cos-dist(x1, x2)
What is the time and space complexity of KNN?
Time complexity ——> O(nd)
Space complexity ——> O(nd)
A LOT
What are the limitations of KNN?
- Space complexity
Eg.
n ≈ 364k and d ≈ 100k —-> 36400M ≈ 36GB —–> 64 GB RAM
↑ space complexity - TIme complexity
36 billion computations —-> few seconds
Therefore can’t be used in low latency applications
Draw the decision surfaces for KNN with different values and also talk about the impact of different types of values.
Draw in rough notes
1. K = 1
- Non-smooth
- Overfitting
2. K = 5
- Smooth
3. K = n
- Underfitting
- Everything becomes majority class
What is the need for cross validation?
k = 1, 2, 3, ……, n ——>Optimal value of k will be somewhere in middle
↓ ↓
overfit underfit
One idea : Trying diff values of k and training on Dtrain to check accuracy in Dtest
Draw accuracy vs k graph
So, we can conclude k = 6 gives the best accuracy on Dtest when using Dtrain as training data.
Small Problem : The sole objective of ML is to perform well on UNSEEN DATA. —-> Generalization
So we can’t say accuracy of model is 96% on unseen data
Cross Validation :
Dn ——–> Dtrain ——> NN (60%)
——–> Dcv ——> Used to determin the right ‘k’ value (20%)
——–> Dtest ——> apply f to get accuracy on unseen data (20%)
Now, if Dtest gives 93% accuracy on optimal value of ‘k’ found on Dcv, we can say accuracy of model is 93% on unseen data.
What is K-fold cross validation?
We saw in Cross Validation :
Dn ——–> Dtrain ——> (60%)
——–> Dcv ——> (20%)
——–> Dtest ——> (20%)
Problem : Only 60% data for NN —–> Is there a way to use 80% of data to compute NN because more data means better algo
K-fold CV:
1. Dn ——-> Dtrain (80%) ——–> for finding K and computing NN
——–> Dtest (20%) ——> unseen data
2. [ D1, D2, D3, D4 ] —–> Randomly 4 equal parts —–> Each consists of 20% of the data
3. Here K’ = 4-fold CV
| | Train | CV |accuracy on CV|
| K = 1 | D1 D2 D3 | D4 | a4 |
| K = 1 | D1 D2 D4 | D3 | a3 |
| K = 1 | D1 D3 D4 | D2 | a2 |
| K = 1 | D2 D3 D4 | D1 | a1 |
| K = 2 | D1 D2 D3 | D4 | a4’ |
| K = 2 | D1 D2 D4 | D3 | a3’ |
| K = 2 | D1 D3 D4 | D2 | a2’ |
| K = 2 | D2 D3 D4 | D1 | a1’ |
:
:
:
:
avg(a1, a2, a3, a4) = abasek=1 ——> Accuracy on K-NN for K=1 on CV
Let abasek=3 —–> best k ——> 3-NN ——> best avg accuracy on 4 fold-cv —-> Train Dtrain for nn with k =3 ——–> then after cv check on Dtest ——–> 93% on unseen data
What is the right K’?
Rule of thumb —> K’ = 10 ——> 10-fold cv
Time it takes to compute optimal/best k in K-NN increases by k’ times if we use k’-fold CV
Given a value of ‘k’ how to determine whether it underfits or overfits?
Graphical Method:
Accuracy = (# correctly classified pts)/(Total # pts)
error = 1 - accuracy
Draw graph in rough notes
What is time based splitting?
—> works only if we have timestamp
Sort Dn in ascending order of time
In datasets like amazon food reviews, a time based splitting is better than random splitting because with time products and their reviews change.
TBS:
T-90 T-30 T-15 T
| NN | k | |
|————————-|————|————|
| Train | CV | Test |
- Whenever time is available and if things/behaviour or data changes over time then time based splitting is preferable over random splitting