Clustering Flashcards
Explain the steps taken in the K-means algorithm
- Randomly generate k random centroids.
- Check the distance from each sample to all centroids.
- Assign the closest centroid to each sample.
- Compute new centroids.
- If no convergence, go back to 2.
What are the pros and cons of k-means clustering?
Pros
- Relatively simple to implement
- Scale to large datasets
- Guarantee convergence (local optima)
Cons
- User has to define k
- Depends on initial values
- Cluster boundaries are equidistant to centers
- Cannot model covariances well
What are the two types of hierarchical clustering? Define them
Agglomerative (bottom-up): start with all samples as clusters, then merge two cluster with smallest intergroup dissimilarity iteratively
Divisive (top-down): start with one cluster that has all data, then split cluster with largest between-group dissimilarity iteratively
Define mixture models
They present p(x) as a mixture of distributions. Can then use Expectation-Maximisation (EM) to find (local) MLE estimators of parameters.
Write the formula for “responsibility” in the context of mixture models
Check notes
What is the silhouette coefficient? What is it used for?
s_i = (b-a) / {max(a, b)}
a: mean distance between sample i and all other points in the same class
mean distance between sample i and all other points in the next nearest cluster
Can be used to select the optimal number of clusters
What does the rand index measure?
The similarity between two cluster assignments