Clustering Flashcards
What is clustering
taking a set of datapoints and putting them into groups so that each group has points that are close to or similar to each other
Euclidian distance (straight line)
sqrt((x1-y1)^2 + (x2-y2)^2)
Rectolinear distance
abs(x1-y1) + abs(x2-y2)
p norm distance (minkowski distance)
pthroot(abs(x1-y1) ^p+ abs(x2-y2)^p)
or
pthroot( sum i = 1 to no of abs(xi-yi)^p)
where p is 2 for straight line
and 1 for rectolinear
generalized to any dimension its the sum over all n dimensions
infinity norm distance
pthroot( sum i = 1 to no of abs(xi-yi)^p)
its approximately equal to inf root(max i abs(xi-yi)^p)
it equals themax i abs(xi-yi) because inf to the inf. root cancels
what is the infinity norm
the largest (absolute) of a set of numbers
Why would you use infinity norm?
how long does it take to do something with multiple simultaneous steps? whatever the maximum time length thing is! duh!
clustering workflow
- pick k cluster centers within range of data
1.assign each data point to nearest cluster center - recalculate cluster centers
- repeat and 1 then 2 until no datapoint changes groups and therefore the cluster centers don’t change
heurisitic
an algorithm that’s not gauranteedto find the best solution
expectation-maximization (like clustering)
an iterative procedure that alternates between taking an expectation (finding cluster centers) and maximizing (assign points to the clusters)
k-mean algorithm
-heuristic
-run several times with different intiial cluster centers and choose the best solution you find
-run with different values of k (# of clusters)
THEN
compare the total distance to the # of clusters and look for the elbow which represent diminishing returns
-important to consider qualitative aspects as well (if we want to)
Predictive clustering
if a new datapoint falls within a cluster we can assign it to that cluster. if it falls outside of a cluster we can assign it to the closest cluster center
varanoid diagram
basically just the space around each cluster center that we would predict a point to be apart of that cluster– based on distance
classification vs clustering
classification - we know response variable, this is supervised learning
clustering - we don’t know the response, this is unsupervised learning
What is clustering useful for
-targeted marketing
-personalized medicine
-physical distance (libraries, police station, branches!)
-image analysis