Chapter 9: Cluster Analysis Flashcards
what is cluster analysis
unsupervised
task of grouping a set of objects so objects in the same group are similar
give the 3 similarity measures
Euclidean distance
cosine similarity
minkowski distance
what are the key tasks of cluster analysis
define distance measure
identify cluster number
perform grouping
evaluate
what are the hyperparameters in cluster analysis
distance measure
cluster number
what is k means clustering
a clustering algorithm that calculates the distances to centre points
assigns to nearest
updates centre using average of cluster points
within each cluster, what is minimised
sum of squares
what is the running time of k means clustering
O( T K N)
what are the drawbacks of k means clustering (3)
doesn’t cope well with noise or outliers
need to decide number of clusters
not suitable for complex patterns
what does the distance between clusters tell us
the similarity between two points
what is single link measure
distance between clusters = minimum distance
what is multi link measure
distance between clusters = maximum distance
what is average link measure
distance between clusters = average distance
describe hierarchical clustering
objects grouped in a tree structure
what is agglomerative clustering
start with atomic clusters and merge until you get one big cluster
what is divisive clustering
start as one big cluster and separate out to atomic clusters
what is a dendrogram
plots data points and shows the distance when they were clustered together
what is the lifetime of a cluster
difference between when created and when merged
how to we get k clusters from hierarchical clustering
cut the tree
what is cluster validation
check the clusters make logical sense
what are the two methods of cluster validation
internal and external criteria
what pattern of variation do we want
a good cluster should have a small in cluster and large incluster variation
how do we calculate within cluster variance
sum for each point in cluster: distance(point, center)
how do we calculate between cluster variance
sum for each cluster: no_points_in_cluster * d^2(cluster, data centre)
what is external validation
validate against ground truth labels
how do we evaluate against ground through labels
rand index
describe rand index
compare cluster ID to class ID
agreement / disagreement table
rand = (a + d) / (total)