Clustering Flashcards
What is a Centroid
The mean vector of all items assigned to a given cluster (i.e. the mean of their feature vectors)
What are partitioning methods (Clustering)
Clustering methods that partition a dataset into a predefined number of clusters
Steps of Lloyd’s Algorithm for k-means clustering
- randomly select k initial cluster centroids
- assign every item to its nearest centroid
- recompute the centroids
- Repeat from step 2 until no reassignments occur
What are the limitations of k-Means, in particular Lloyd’s algorithm
- must specify number of cluster k
- Lloyd’s algorithm is highly sensitive to choice of initial clusters
- assumes spherical shaped clusters
- unable to handle outliers and noise
Advantages of k-Means
- fast, easy to implement the heuristics
- “good enough” in a wide variety of tasks and domains
What is a common strategy for tackling cluster initialisation in k-Means
run the algorithm multiple times, select the solution(s) that scores well according to some validation measure
Explain how the centroids are initialised in the k-means++
- select first centroid uniformly at random
- Repeat k - 1 times:
- pick a point with probability proportional to the shortest distanced centroid
- add point to Centroids
What is the intuition for k-means++ initialisation
The cluster centroids are initialised by favouring points that are farther away from existing centroids, resulting in the centroids being spread out across the dataset.
What is cluster validation
Measures for automatically producing a quantitative evaluation of the quality of a clustering
What is the Silhouette Measure
Validation measure which quantifies degree to which each item belongs in its assigned cluster, relative to the other clusters
How to calculate the silhouette width
a = average distance to all other items in same cluster
b = average distance to all other items in nearest competing cluster
silhouette width = (b-a) / max(a, b)
How to compute the average silhouette width (ASW)
calculate overall score for a clustering by averaging the silhouette widths for all items
Brief description of Agglomerative Hierarchical Clustering
Begin with each item assigned to its own cluster. Apply a bottom-up strategy where, at each step, the most similar pair of clusters are merged
Brief description of Divisive Hierarchical Clustering
Begin with a single cluster containing all items. Apply a top-down strategy where, at each step, a chosen cluster is split into two sub-clusters
AGNES algorithm
Inputs: Distance matrix, Cluster Metric
1. Assign every item to its own cluster (leaf nodes)
2. Find the closest pair of clusters and merge them
3. Compute distances between the new cluster and each of the remaining cluster
4. Repeat from step 2 until there is one cluster (root node)