Clustering Flashcards
What is a Centroid
The mean vector of all items assigned to a given cluster (i.e. the mean of their feature vectors)
What are partitioning methods (Clustering)
Clustering methods that partition a dataset into a predefined number of clusters
Steps of Lloyd’s Algorithm for k-means clustering
- randomly select k initial cluster centroids
- assign every item to its nearest centroid
- recompute the centroids
- Repeat from step 2 until no reassignments occur
What are the limitations of k-Means, in particular Lloyd’s algorithm
- must specify number of cluster k
- Lloyd’s algorithm is highly sensitive to choice of initial clusters
- assumes spherical shaped clusters
- unable to handle outliers and noise
Advantages of k-Means
- fast, easy to implement the heuristics
- “good enough” in a wide variety of tasks and domains
What is a common strategy for tackling cluster initialisation in k-Means
run the algorithm multiple times, select the solution(s) that scores well according to some validation measure
Explain how the centroids are initialised in the k-means++
- select first centroid uniformly at random
- Repeat k - 1 times:
- pick a point with probability proportional to the shortest distanced centroid
- add point to Centroids
What is the intuition for k-means++ initialisation
The cluster centroids are initialised by favouring points that are farther away from existing centroids, resulting in the centroids being spread out across the dataset.
What is cluster validation
Measures for automatically producing a quantitative evaluation of the quality of a clustering
What is the Silhouette Measure
Validation measure which quantifies degree to which each item belongs in its assigned cluster, relative to the other clusters
How to calculate the silhouette width
a = average distance to all other items in same cluster
b = average distance to all other items in nearest competing cluster
silhouette width = (b-a) / max(a, b)
How to compute the average silhouette width (ASW)
calculate overall score for a clustering by averaging the silhouette widths for all items
Brief description of Agglomerative Hierarchical Clustering
Begin with each item assigned to its own cluster. Apply a bottom-up strategy where, at each step, the most similar pair of clusters are merged
Brief description of Divisive Hierarchical Clustering
Begin with a single cluster containing all items. Apply a top-down strategy where, at each step, a chosen cluster is split into two sub-clusters
AGNES algorithm
Inputs: Distance matrix, Cluster Metric
1. Assign every item to its own cluster (leaf nodes)
2. Find the closest pair of clusters and merge them
3. Compute distances between the new cluster and each of the remaining cluster
4. Repeat from step 2 until there is one cluster (root node)
Define the three Cluster Metrics (Linkage)
Single Linkage: define cluster distance as the smallest pairwise distance between items from each cluster
Complete Linkage: define cluster distance as the largest pairwise distance between items from each cluster
Average Linkage: define cluster distance as the average of all pairwise distances between items from each cluster
Divisive Algorithms Template
- Start with all items in a single cluster
- Repeat until all items are in their own cluster
- choose an existing cluster to split using some splitting criterion
- replace the chosen cluster into two sub-clusters
Advantages of Hierarchical Clustering
- Allows for multiple levels of granularity, both broad clusters and niche clusters.
- No requirement to select the “correct” value for number of clusters in advance.
Disadvantages of Hierarchical Clustering
- Poor decisions made early in the clustering process can greatly influence the quality of the final clustering.
- Once a merging or splitting decision has been made, there exists no facility to rectify a mistake at a later stage.
- More computationally expensive than partitional methods, particularly for AGNES
What constitutes high quality clusters
high intra-cluster similarity and low inter-cluster similarity
in K-means, the global optimum may be found using what techniques?
- deterministic annealing
- genetic algorithms
What is a Medoid
A point in the cluster, whose dissimilarities with all the other points in the cluster are minimum
what is the basic idea of Partitioning Around Medoids (PAM)
start from an initial set of medoids and iteratively replace a medoid by a non-medoid if it reduces the total distance of the resulting clustering