Clustering Flashcards

1
Q

What is clustering

A

taking a set of datapoints and putting them into groups so that each group has points that are close to or similar to each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Euclidian distance (straight line)

A

sqrt((x1-y1)^2 + (x2-y2)^2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Rectolinear distance

A

abs(x1-y1) + abs(x2-y2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

p norm distance (minkowski distance)

A

pthroot(abs(x1-y1) ^p+ abs(x2-y2)^p)
or
pthroot( sum i = 1 to no of abs(xi-yi)^p)
where p is 2 for straight line
and 1 for rectolinear

generalized to any dimension its the sum over all n dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

infinity norm distance

A

pthroot( sum i = 1 to no of abs(xi-yi)^p)

its approximately equal to inf root(max i abs(xi-yi)^p)

it equals themax i abs(xi-yi) because inf to the inf. root cancels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is the infinity norm

A

the largest (absolute) of a set of numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why would you use infinity norm?

A

how long does it take to do something with multiple simultaneous steps? whatever the maximum time length thing is! duh!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

clustering workflow

A
  1. pick k cluster centers within range of data
    1.assign each data point to nearest cluster center
  2. recalculate cluster centers
  3. repeat and 1 then 2 until no datapoint changes groups and therefore the cluster centers don’t change
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

heurisitic

A

an algorithm that’s not gauranteedto find the best solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

expectation-maximization (like clustering)

A

an iterative procedure that alternates between taking an expectation (finding cluster centers) and maximizing (assign points to the clusters)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

k-mean algorithm

A

-heuristic
-run several times with different intiial cluster centers and choose the best solution you find
-run with different values of k (# of clusters)

THEN
compare the total distance to the # of clusters and look for the elbow which represent diminishing returns
-important to consider qualitative aspects as well (if we want to)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Predictive clustering

A

if a new datapoint falls within a cluster we can assign it to that cluster. if it falls outside of a cluster we can assign it to the closest cluster center

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

varanoid diagram

A

basically just the space around each cluster center that we would predict a point to be apart of that cluster– based on distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

classification vs clustering

A

classification - we know response variable, this is supervised learning
clustering - we don’t know the response, this is unsupervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is clustering useful for

A

-targeted marketing
-personalized medicine
-physical distance (libraries, police station, branches!)
-image analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly