Module 3_ 2. Classification And Regression Models_ K-Nearest Neighbors Flashcards

1
Q

Explain KNN using an example.

A

Make the drawing in the rough notes.
Given xq —–> Find yq ∈ {0,1}

  1. Find K-nearest points to xq in D
    let k=3
    {x1, x2, x3} ——> 3NN to xq
    ↓ ↓ ↓
    y1 y2 y3
  2. {y1,y2,y3} —> Majority vote
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the failure cases in KNN?

A
  1. xq is far away from points in D
  2. jumble of +ve/-ve points (no useful information)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How to calculate Euclidean distance?

A

x1 = (x11, x12)
x2 = (x21, x22)
d = len of shortest line from x1 to x2

d = √((x21 - x11)^2 + (x22 - x12)^2) = ||x1 - x2||base2
Euclidean distance :
||x1 - x2||base2 = (Σ(x1i - x2i)^2)^(1/2)
||x1||base2 = Euclidean distance of x1 from origin
= (Σ x1i^2)^(1/2)

Also
||(x1 - x2)||base2 ——> L2-norm of vector (x1 -x2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to calculate Manhattan distance?

A

Manhattan distance ————–>Σ |x1i - x2i|
||(x1 - x2)||base1 ——> L1-norm of vector (x1 -x2)
||x1||base1 = Σ |x1i|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How to calculate Minkowski distance?

A

Minkowski distance:
||x1 - x2||p = (Σ |x1i - x2i|^p)^(1/p) ——-> p != 0 and p > 0
p = 2 ——> Minkowski distance —–> Euclidean distance
p = 1 ——> Minkowski distance —–> Manhattan distance

  • Distance is between 2 points and norms are always for a vector
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to calculate Hamming distance?

A

x1 = [ 0, 1, 1, 0, 1, 0, 0 ]
↓ ↓ ↓
x1 = [ 1, 0, 1, 0, 1, 0, 1 ]
Hamming-distance(x1,x2) = # of locations/dimensions where binary vectors differ = 3

Strings:
x1 = a, b, c, a, d, e, f, g, h, i, k
↓ ↓ ↓ ↓
x2 = a, c, b, a, d, e, g, f, h, i, k
Hamming-distance(x1,x2) = 4

Gene Code Sequence:
x1 = A A G T C T C A G
↓ ↓ ↓ ↓
x2 = A G A T C T C G A
Hamming-distance(x1,x2) = 4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain Cosine distance and Cosine similarity.

A

Intuitively,
similarity ↓ ———> distance ↑
similarity ↑ ———> distance ↓

By definition,
1 - cos-sim(x1, x2) = cos-dist(x1, x2)

d = Euclidean distance
cos-sim(x1, x2) = cos θ
θ => angle between x1 and x2

Draw in rough notes
cos-sim(x1, x2) = cos θ
cos-dist(x1, x2) = 1 - cos θ

cos-sim(x1, x3) = 1 (since θx1,x3 = 0 degree)
cos-dist(x1, x3) = 1 - 1 = 0

Euc. distance d13 > d12 but cosine-distance cos-dist13 < cos-dist12

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

If x1 and x2 are unit vectors and θ is the angle between x1 and x2, then what is the relationship between euclidean distance and cosine similarity of x1 and x2?

A

We know,
cos θ = (x1.x2)/||x1||||x2||
If x1 and x2 are unit vectors,
||x1|| = ||x2|| = 1
Therefore,
cos θ = x1.x2

Euclidean distance(x1, x2) = 2 * (1 - cos θ)
= 2 * cos-dist(x1, x2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the time and space complexity of KNN?

A

Time complexity ——> O(nd)
Space complexity ——> O(nd)
A LOT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the limitations of KNN?

A
  1. Space complexity
    Eg.
    n ≈ 364k and d ≈ 100k —-> 36400M ≈ 36GB —–> 64 GB RAM
    ↑ space complexity
  2. TIme complexity
    36 billion computations —-> few seconds
    Therefore can’t be used in low latency applications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Draw the decision surfaces for KNN with different values and also talk about the impact of different types of values.

A

Draw in rough notes
1. K = 1
- Non-smooth
- Overfitting
2. K = 5
- Smooth
3. K = n
- Underfitting
- Everything becomes majority class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the need for cross validation?

A

k = 1, 2, 3, ……, n ——>Optimal value of k will be somewhere in middle
↓ ↓
overfit underfit
One idea : Trying diff values of k and training on Dtrain to check accuracy in Dtest
Draw accuracy vs k graph
So, we can conclude k = 6 gives the best accuracy on Dtest when using Dtrain as training data.

Small Problem : The sole objective of ML is to perform well on UNSEEN DATA. —-> Generalization
So we can’t say accuracy of model is 96% on unseen data
Cross Validation :
Dn ——–> Dtrain ——> NN (60%)
——–> Dcv ——> Used to determin the right ‘k’ value (20%)
——–> Dtest ——> apply f to get accuracy on unseen data (20%)
Now, if Dtest gives 93% accuracy on optimal value of ‘k’ found on Dcv, we can say accuracy of model is 93% on unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is K-fold cross validation?

A

We saw in Cross Validation :
Dn ——–> Dtrain ——> (60%)
——–> Dcv ——> (20%)
——–> Dtest ——> (20%)
Problem : Only 60% data for NN —–> Is there a way to use 80% of data to compute NN because more data means better algo
K-fold CV:
1. Dn ——-> Dtrain (80%) ——–> for finding K and computing NN
——–> Dtest (20%) ——> unseen data
2. [ D1, D2, D3, D4 ] —–> Randomly 4 equal parts —–> Each consists of 20% of the data
3. Here K’ = 4-fold CV
| | Train | CV |accuracy on CV|
| K = 1 | D1 D2 D3 | D4 | a4 |
| K = 1 | D1 D2 D4 | D3 | a3 |
| K = 1 | D1 D3 D4 | D2 | a2 |
| K = 1 | D2 D3 D4 | D1 | a1 |
| K = 2 | D1 D2 D3 | D4 | a4’ |
| K = 2 | D1 D2 D4 | D3 | a3’ |
| K = 2 | D1 D3 D4 | D2 | a2’ |
| K = 2 | D2 D3 D4 | D1 | a1’ |
:
:
:
:
avg(a1, a2, a3, a4) = abasek=1 ——> Accuracy on K-NN for K=1 on CV

Let abasek=3 —–> best k ——> 3-NN ——> best avg accuracy on 4 fold-cv —-> Train Dtrain for nn with k =3 ——–> then after cv check on Dtest ——–> 93% on unseen data

What is the right K’?
Rule of thumb —> K’ = 10 ——> 10-fold cv

Time it takes to compute optimal/best k in K-NN increases by k’ times if we use k’-fold CV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Given a value of ‘k’ how to determine whether it underfits or overfits?

A

Graphical Method:
Accuracy = (# correctly classified pts)/(Total # pts)
error = 1 - accuracy
Draw graph in rough notes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is time based splitting?

A

—> works only if we have timestamp

Sort Dn in ascending order of time
In datasets like amazon food reviews, a time based splitting is better than random splitting because with time products and their reviews change.

TBS:
T-90 T-30 T-15 T
| NN | k | |
|————————-|————|————|
| Train | CV | Test |

  • Whenever time is available and if things/behaviour or data changes over time then time based splitting is preferable over random splitting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Can we use K-NN for regression. If yes, how?

A

YES
1. Given xq find k-nearest neighbours —> (x1,y1),(x2,y2),…,(xk,yk)
2. yq <——— y1, y2, … , yk ——> R
yq = mean(yi)
OR
yq = median(yi)
—> simple extension/modification of classification
—> median is less prone to outliers

17
Q

Explain weighted K-NN.

A

Eg. 5-NN where k = 5
2 —-> -ve and 3 —> +ve
wi
x1 , y1 —> d1 = 0.1 —-> -ve | 10 |
x2 , y2 —> d2 = 0.2 —-> -ve | 5 |
x3 , y3 —> d3 = 1.0 —-> +ve | 1 |
x4 , y4 —> d4 = 2.0 —-> +ve | 0.5 |
x5 , y5 —> d5 = 4.0 —-> +ve | 0.25 |
wi = 1/di ——> Simplest weight function
As di↑, wi↓,
di↓, wi↑

18
Q

Explain Voronoi Diagram.

A

Voronoi diagram is a partition of a plane into regions close to each of a given set of objects
Draw in rough notes

19
Q

Explain Binary Search Tree. Is it better than K-NN? If yes, why?

A

K-NN —> Time complexity —> O(n) —-> If d and k is small
—> Space complexity —> O(n)
Is there a way —–> Make time complexity from O(n) to O(logn)
eg. n = 1024 ——-> log(n) = 10
Binary Search Tree (BST):
Problem: Given a sorted array, find a number is present in array or not.
1. Construct a BST
2. Find the number in array based on conditions of the BST

BST Time complexity —-> O(logn) —->depth of the tree

How to build KD-Tree which is basically a type of BST(short for k-dimensional tree):
1. Pick x-axis
2. Project points onto x-axis
3. Compute median
4. Split data using median
5. Alternate between axis and follow the same steps

20
Q

What are the limitations of KD-Tree?

A

Limitations:
1.When d is not small (i.e. 2,3,4,5)
—–> d =10 —-> 2^d = 1024 # of adjacent cells
—–> d =20 —-> 2^d = 1 Million # of adjacent cells
When d is 10,11,12,….. time complexity increases dramatically
It is no more O(logn) but it becomes O(2^dlogn)
If d is large(eg. 2^d ≈ n) then it becomes O(nlogn) —> worst than brute force
2. Even when d is small O(logn) holds when data is uniformly distributed else it will move towards O(n)

21
Q

Explain Hashing vs Locality Sensitive Hashing (LSH) with example.

A

Eg. array a = [ 2, 1, 5, 6, 7, 5, 8] —-> unordered
Q. Find if 5 exists in a.
Hashing:
x —–> h(x) i.e. hashing function —-> returns the value
If we had done sequential/linear search then the time complexity for that would be O(n)
In case of hashing the time complexity —–> O(1)
Locality Sensitive Hashing(LSH):
—-> Type of hashing used to find the points in the same neighborhood as xq given xq
eg. xq —–> h(xq) ——> x1,x2,x4,x7,…. (returns points in the same neighborhood)

22
Q

Explain LSH for cosine similarity.

A

Draw diagram in rough notes
cos-sim(x1,x2) = cosθ
cos-sim(x1,x3) = cos0 = 1
As θ↑, x1 and x2 are angularly separated —-> distance↑ —–> cosine↓ ——> similarity↓
cosine-simialrity ≈ angular-similarity

Draw diagram in rough notes
θ12 ~ 0 —-> x1 and x2 are similar/close.
h(x1)=h(x2)=key —–> both are in same neighborhood

LSH —-> Randomized algo
—-> Will not always give you correct answers
—-> Will try to give correct answer with high probability

Steps:
1. Draw a random hyperplane (π1)
w1Tx = 0 —–> w1 : normal to the plane
w1 is a d-dim random vector with indexing from 0 to d-1
w1 = numpy.random.normal(0,1,d)
** w1Tx1 >= 0 and w1Tx2 <= 0 —–> x1 is +ve point and x2 is -ve point

  1. Lets say we draw two more such planes w2 and w3
    Draw diagram in rough notes
    m = # of hyperplanes
    Point belonging between these slice will have a hash value based on which side of each plane they lie on eg. h(x7) = (+1,-1,-1)
    Here +1 = sign(w1Tx), -1 = sign(w2Tx), -1 = sign(w3Tx)

To calculate all this,
Time complexity ~ O(mdn)
Space complexity ~ O(nd)

Given a query point xq,
1. Calculate h(xq) i.e. the hash value as shown earlier using signs
2. Pass h(xq) as the key in hash table
—–> d.get(h(xq)) ——-> returns [x1,x2,x4,x7]

Time complexity for querying ~ O(md+n’d) –>n’=#elements in bucket
If n’ is small or n’≈m then —-> O(md)
Typically m≈log(n) ——-> m = # of hyperplanes
Therefore it becomes O(d*logn)

23
Q

What is the problem with LSH with cos-sim? How to overcome it?

A
  • Could miss a NN in cos-sim as your measure