Hierarchical Clustering Flashcards
What is Hierarchical clustering?
By far, one of the most common clustering techniques
Produces a hierarchy (dendrogram) of nested clusters that can be analyzed and visualized
What are the approaches to implement hierarchical clustering?
• Agglomerative
–Start with individual clusters
– At each step, merge the closest pair of clusters until only one cluster (or k clusters) left
• Divisive
–Start with one cluster containing all the data points
– At each step, split a cluster until each cluster contains a point (or there are k clusters)
What is the most used approach to build hierarchical clusters and what is the key operation behind it?
Agglomerative hierarchical clustering is
the most popular hierarchical clustering technique
Key operation is the computation of the proximity of two clusters
Different approaches to defining the distance between clusters
How can we calculate the distance between different clusters?
When performing hierarchical clustering, the distance between clusters is determined by a linkage method. The choice of linkage method affects the structure of the resulting dendrogram and depends on the nature of the data and the clustering objective. Here are the main methods for calculating distances between clusters:
- Single Linkage (Nearest Neighbor)• Definition: The distance between two clusters is defined as the minimum distance between any pair of points in the two clusters.
d(\text{Cluster}_1, \text{Cluster}2) = \min{x \in \text{Cluster}_1, y \in \text{Cluster}_2} d(x, y)
• Characteristics: • Creates elongated or “chained” clusters. • Sensitive to noise and outliers because they can link distant points. • Use case: When you expect clusters to have an irregular or elongated shape.
- Complete Linkage (Farthest Neighbor)• Definition: The distance between two clusters is defined as the maximum distance between any pair of points in the two clusters.
d(\text{Cluster}_1, \text{Cluster}2) = \max{x \in \text{Cluster}_1, y \in \text{Cluster}_2} d(x, y)
• Characteristics: • Tends to create compact and spherical clusters. • Less sensitive to noise compared to single linkage. • Use case: When you want compact clusters and can tolerate splitting elongated ones.
- Average Linkage• Definition: The distance between two clusters is the average distance between all pairs of points (one from each cluster).
d(\text{Cluster}_1, \text{Cluster}2) = \frac{1}{|C_1| \cdot |C_2|} \sum{x \in \text{Cluster}1} \sum{y \in \text{Cluster}_2} d(x, y)
• Characteristics: • Balances the compactness of clusters and their proximity. • Produces moderate cluster shapes. • Use case: General-purpose clustering when you want a balance between compactness and flexibility.
- Centroid Linkage• Definition: The distance between two clusters is the distance between their centroids (mean points).
d(\text{Cluster}_1, \text{Cluster}_2) = d(\text{Centroid}_1, \text{Centroid}_2)
• Characteristics: • Sensitive to outliers because centroids are affected by extreme points. • May produce non-hierarchical structures if centroids shift and cause cluster inversions. • Use case: When clusters are expected to be approximately spherical and of similar size.
- Ward’s Linkage (Minimum Variance Method)• Definition: The distance between two clusters is the increase in variance for the merged cluster. This minimizes the total within-cluster variance.
d(\text{Cluster}_1, \text{Cluster}_2) = \Delta \text{Error Sum of Squares (ESS)}
• Characteristics: • Produces compact and spherical clusters. • Less sensitive to outliers compared to centroid linkage. • Use case: When reducing within-cluster variance is a priority, often used in practice.
Key Considerations for Choosing a Method
1. Data Shape: • Irregular/elongated: Use single linkage. • Compact/spherical: Use complete or Ward’s linkage. 2. Sensitivity to Outliers: • More sensitive: Single and centroid linkage. • Less sensitive: Complete and Ward’s linkage. 3. Cluster Sizes: • Similar size: Ward’s or centroid linkage. • Different sizes: Single or average linkage.
Example Workflow:
1. Compute a distance matrix (e.g., Euclidean, Manhattan). 2. Select a linkage method. 3. Merge clusters iteratively based on the selected distance metric. 4. Visualize the clustering hierarchy using a dendrogram.
Detail the distance calculation using the Ward’s method
Ward’s Method: Distance Calculation in Hierarchical Clustering
Ward’s method, also known as the minimum variance method, is a hierarchical clustering technique that minimizes the increase in the total within-cluster variance (error sum of squares, ESS) when merging two clusters. This method produces compact and spherical clusters and is one of the most commonly used linkage methods.
Key Idea:
When two clusters A and B are merged into a new cluster C , the distance between A and B is determined by the increase in the total within-cluster variance (or ESS).
Formula for Ward’s Distance:
The Ward’s distance d(A, B) between two clusters A and B is given by:
d(A, B) = \frac{|A||B|}{|A| + |B|} | \mu_A - \mu_B |^2
Where:
• |A| and |B| : Number of data points in clusters A and B , respectively.
• \mu_A and \mu_B : Centroids (mean vectors) of clusters A and B , respectively.
• | \mu_A - \mu_B |^2 : Squared Euclidean distance between the centroids of clusters A and B .
Steps to Calculate Ward’s Distance:
1. Compute the Centroids: • For cluster A , calculate:
\mu_A = \frac{1}{|A|} \sum_{x \in A} x
• Similarly, for cluster B , calculate:
\mu_B = \frac{1}{|B|} \sum_{x \in B} x
2. Calculate the Squared Euclidean Distance Between Centroids:
Here, d is the number of dimensions in the data.
3. Scale by Cluster Sizes:
• Multiply by the product of the cluster sizes ( |A||B| ).
• Divide by the total size of the merged cluster ( |A| + |B| ).
4. Result:
The resulting distance represents the increase in within-cluster variance if clusters A and B are merged.
Why Use Ward’s Method?
Ward’s method explicitly minimizes the total within-cluster variance at each step. This ensures:
• Clusters are compact (low variance).
• The clustering process is stable, avoiding early chaining effects seen in methods like single linkage.
Example:
Given:
• Cluster A contains |A| = 3 points with centroid \mu_A = (2, 3) . • Cluster B contains |B| = 2 points with centroid \mu_B = (4, 6) .
Step 1: Compute | \mu_A - \mu_B |^2 :
| \mu_A - \mu_B |^2 = (2 - 4)^2 + (3 - 6)^2 = 4 + 9 = 13
Step 2: Compute d(A, B) :
d(A, B) = \frac{|A||B|}{|A| + |B|} | \mu_A - \mu_B |^2
Substitute values:
d(A, B) = \frac{3 \cdot 2}{3 + 2} \cdot 13 = \frac{6}{5} \cdot 13 = 15.6
Thus, the Ward’s distance between clusters A and B is 15.6.
Summary of Ward’s Method:
• Focus: Minimizing the increase in within-cluster variance at each merge. • Key Calculation: Combines cluster sizes and centroid distances to balance compactness and representativeness. • Use Case: Works well for compact, spherical clusters and is robust
\mu_A - \mu_B |^2 = \sum_{i=1}^d (\mu_{A,i} - \mu_{B,i})^2
Detail the distance calculation using the Lance Willian’s method
Lance-Williams Method: Distance Calculation in Hierarchical Clustering
The Lance-Williams formula is a general and flexible framework used to compute distances between clusters during hierarchical clustering. It supports a variety of linkage methods (e.g., single, complete, average, Ward’s) by defining a unified formula that updates the distance matrix as clusters are merged.
Formula for the Lance-Williams Update:
When two clusters A and B are merged into a new cluster C , the distance between C and any other cluster D is given by:
d(C, D) = \alpha_A d(A, D) + \alpha_B d(B, D) + \beta d(A, B) + \gamma |d(A, D) - d(B, D)|
Where:
• d(A, D) : Distance between cluster A and cluster D .
• d(B, D) : Distance between cluster B and cluster D .
• d(A, B) : Distance between the merged clusters A and B .
• \alpha_A and \alpha_B : Weighting coefficients for d(A, D) and d(B, D) .
• \beta : Weighting factor for d(A, B) .
• \gamma : Weighting factor for the absolute difference |d(A, D) - d(B, D)| .
The values of \alpha_A , \alpha_B , \beta , and \gamma depend on the linkage method being used.
Special Cases of Linkage Methods Using Lance-Williams:
- Single Linkage (Nearest Neighbor):• \alpha_A = \frac{1}{2}, \, \alpha_B = \frac{1}{2}, \, \beta = 0, \, \gamma = -\frac{1}{2}
d(C, D) = \frac{1}{2} d(A, D) + \frac{1}{2} d(B, D) - \frac{1}{2} |d(A, D) - d(B, D)|
• Distance is based on the minimum distance between any two points from the two clusters.
- Complete Linkage (Farthest Neighbor):• \alpha_A = \frac{1}{2}, \, \alpha_B = \frac{1}{2}, \, \beta = 0, \, \gamma = \frac{1}{2}
d(C, D) = \frac{1}{2} d(A, D) + \frac{1}{2} d(B, D) + \frac{1}{2} |d(A, D) - d(B, D)|
• Distance is based on the maximum distance between any two points from the two clusters.
- Average Linkage:• \alpha_A = \frac{|A|}{|A| + |B|}, \, \alpha_B = \frac{|B|}{|A| + |B|}, \, \beta = 0, \, \gamma = 0
d(C, D) = \frac{|A|}{|A| + |B|} d(A, D) + \frac{|B|}{|A| + |B|} d(B, D)
• Distance is the weighted average of the distances from D to A and B , based on the cluster sizes.
- Centroid Linkage:• \alpha_A = \frac{|A|}{|A| + |B|}, \, \alpha_B = \frac{|B|}{|A| + |B|}, \, \beta = -\frac{|A| \cdot |B|}{(|A| + |B|)^2}, \, \gamma = 0
d(C, D) = \frac{|A|}{|A| + |B|} d(A, D) + \frac{|B|}{|A| + |B|} d(B, D) - \frac{|A| \cdot |B|}{(|A| + |B|)^2} d(A, B)
• Distance is based on the centroids of the clusters and their relative sizes.
- Ward’s Method:• \alpha_A = \frac{|A| + |D|}{|A| + |B| + |D|}, \, \alpha_B = \frac{|B| + |D|}{|A| + |B| + |D|}, \, \beta = -\frac{|D|}{|A| + |B| + |D|}, \, \gamma = 0
d(C, D) = \frac{|A| + |D|}{|A| + |B| + |D|} d(A, D) + \frac{|B| + |D|}{|A| + |B| + |D|} d(B, D) - \frac{|D|}{|A| + |B| + |D|} d(A, B)
• Distance is determined by minimizing the total within-cluster variance.
Key Features of Lance-Williams Method:
1. Flexibility: By adjusting the coefficients ( \alpha_A, \alpha_B, \beta, \gamma ), the method can implement different linkage strategies. 2. Efficiency: Allows dynamic updating of the distance matrix without recalculating pairwise distances, reducing computational cost. 3. Unified Framework: All popular hierarchical clustering methods can be expressed using this formula.
Workflow Using Lance-Williams:
1. Start with a distance matrix for individual points. 2. Select a linkage method (e.g., single, complete, Ward’s). 3. Use the Lance-Williams formula to compute distances when merging clusters. 4. Repeat until all points are merged into a single cluster.
The Lance-Williams method streamlines hierarchical clustering by offering a single, adaptable framework for calculating inter-cluster distances.