Clustering Flashcards
What is Hierarchical clustering?
By far, one of the most common clustering techniques
Produces a hierarchy (dendrogram) of nested clusters that can be analyzed and visualized
What are the approaches to implement hierarchical clustering?
• Agglomerative
–Start with individual clusters
– At each step, merge the closest pair of clusters until only one cluster (or k clusters) left
• Divisive
–Start with one cluster containing all the data points
– At each step, split a cluster until each cluster contains a point (or there are k clusters)
What is the most used approach to build hierarchical clusters and what is the key operation behind it?
Agglomerative hierarchical clustering is
the most popular hierarchical clustering technique
Key operation is the computation of the proximity of two clusters
Different approaches to defining the distance between clusters
How can we calculate the distance between different clusters?
When performing hierarchical clustering, the distance between clusters is determined by a linkage method. The choice of linkage method affects the structure of the resulting dendrogram and depends on the nature of the data and the clustering objective. Here are the main methods for calculating distances between clusters:
- Single Linkage (Nearest Neighbor)• Definition: The distance between two clusters is defined as the minimum distance between any pair of points in the two clusters.
d(\text{Cluster}_1, \text{Cluster}2) = \min{x \in \text{Cluster}_1, y \in \text{Cluster}_2} d(x, y)
• Characteristics: • Creates elongated or “chained” clusters. • Sensitive to noise and outliers because they can link distant points. • Use case: When you expect clusters to have an irregular or elongated shape.
- Complete Linkage (Farthest Neighbor)• Definition: The distance between two clusters is defined as the maximum distance between any pair of points in the two clusters.
d(\text{Cluster}_1, \text{Cluster}2) = \max{x \in \text{Cluster}_1, y \in \text{Cluster}_2} d(x, y)
• Characteristics: • Tends to create compact and spherical clusters. • Less sensitive to noise compared to single linkage. • Use case: When you want compact clusters and can tolerate splitting elongated ones.
- Average Linkage• Definition: The distance between two clusters is the average distance between all pairs of points (one from each cluster).
d(\text{Cluster}_1, \text{Cluster}2) = \frac{1}{|C_1| \cdot |C_2|} \sum{x \in \text{Cluster}1} \sum{y \in \text{Cluster}_2} d(x, y)
• Characteristics: • Balances the compactness of clusters and their proximity. • Produces moderate cluster shapes. • Use case: General-purpose clustering when you want a balance between compactness and flexibility.
- Centroid Linkage• Definition: The distance between two clusters is the distance between their centroids (mean points).
d(\text{Cluster}_1, \text{Cluster}_2) = d(\text{Centroid}_1, \text{Centroid}_2)
• Characteristics: • Sensitive to outliers because centroids are affected by extreme points. • May produce non-hierarchical structures if centroids shift and cause cluster inversions. • Use case: When clusters are expected to be approximately spherical and of similar size.
- Ward’s Linkage (Minimum Variance Method)• Definition: The distance between two clusters is the increase in variance for the merged cluster. This minimizes the total within-cluster variance.
d(\text{Cluster}_1, \text{Cluster}_2) = \Delta \text{Error Sum of Squares (ESS)}
• Characteristics: • Produces compact and spherical clusters. • Less sensitive to outliers compared to centroid linkage. • Use case: When reducing within-cluster variance is a priority, often used in practice.
Key Considerations for Choosing a Method
1. Data Shape: • Irregular/elongated: Use single linkage. • Compact/spherical: Use complete or Ward’s linkage. 2. Sensitivity to Outliers: • More sensitive: Single and centroid linkage. • Less sensitive: Complete and Ward’s linkage. 3. Cluster Sizes: • Similar size: Ward’s or centroid linkage. • Different sizes: Single or average linkage.
Example Workflow:
1. Compute a distance matrix (e.g., Euclidean, Manhattan). 2. Select a linkage method. 3. Merge clusters iteratively based on the selected distance metric. 4. Visualize the clustering hierarchy using a dendrogram.
Detail the distance calculation using the Ward’s method
Ward’s Method: Distance Calculation in Hierarchical Clustering
Ward’s method, also known as the minimum variance method, is a hierarchical clustering technique that minimizes the increase in the total within-cluster variance (error sum of squares, ESS) when merging two clusters. This method produces compact and spherical clusters and is one of the most commonly used linkage methods.
Key Idea:
When two clusters A and B are merged into a new cluster C , the distance between A and B is determined by the increase in the total within-cluster variance (or ESS).
Formula for Ward’s Distance:
The Ward’s distance d(A, B) between two clusters A and B is given by:
d(A, B) = \frac{|A||B|}{|A| + |B|} | \mu_A - \mu_B |^2
Where:
• |A| and |B| : Number of data points in clusters A and B , respectively.
• \mu_A and \mu_B : Centroids (mean vectors) of clusters A and B , respectively.
• | \mu_A - \mu_B |^2 : Squared Euclidean distance between the centroids of clusters A and B .
Steps to Calculate Ward’s Distance:
1. Compute the Centroids: • For cluster A , calculate:
\mu_A = \frac{1}{|A|} \sum_{x \in A} x
• Similarly, for cluster B , calculate:
\mu_B = \frac{1}{|B|} \sum_{x \in B} x
2. Calculate the Squared Euclidean Distance Between Centroids:
Here, d is the number of dimensions in the data.
3. Scale by Cluster Sizes:
• Multiply by the product of the cluster sizes ( |A||B| ).
• Divide by the total size of the merged cluster ( |A| + |B| ).
4. Result:
The resulting distance represents the increase in within-cluster variance if clusters A and B are merged.
Why Use Ward’s Method?
Ward’s method explicitly minimizes the total within-cluster variance at each step. This ensures:
• Clusters are compact (low variance).
• The clustering process is stable, avoiding early chaining effects seen in methods like single linkage.
Example:
Given:
• Cluster A contains |A| = 3 points with centroid \mu_A = (2, 3) . • Cluster B contains |B| = 2 points with centroid \mu_B = (4, 6) .
Step 1: Compute | \mu_A - \mu_B |^2 :
| \mu_A - \mu_B |^2 = (2 - 4)^2 + (3 - 6)^2 = 4 + 9 = 13
Step 2: Compute d(A, B) :
d(A, B) = \frac{|A||B|}{|A| + |B|} | \mu_A - \mu_B |^2
Substitute values:
d(A, B) = \frac{3 \cdot 2}{3 + 2} \cdot 13 = \frac{6}{5} \cdot 13 = 15.6
Thus, the Ward’s distance between clusters A and B is 15.6.
Summary of Ward’s Method:
• Focus: Minimizing the increase in within-cluster variance at each merge. • Key Calculation: Combines cluster sizes and centroid distances to balance compactness and representativeness. • Use Case: Works well for compact, spherical clusters and is robust
\mu_A - \mu_B |^2 = \sum_{i=1}^d (\mu_{A,i} - \mu_{B,i})^2
Detail the distance calculation using the Lance Willian’s method
Lance-Williams Method: Distance Calculation in Hierarchical Clustering
The Lance-Williams formula is a general and flexible framework used to compute distances between clusters during hierarchical clustering. It supports a variety of linkage methods (e.g., single, complete, average, Ward’s) by defining a unified formula that updates the distance matrix as clusters are merged.
Formula for the Lance-Williams Update:
When two clusters A and B are merged into a new cluster C , the distance between C and any other cluster D is given by:
d(C, D) = \alpha_A d(A, D) + \alpha_B d(B, D) + \beta d(A, B) + \gamma |d(A, D) - d(B, D)|
Where:
• d(A, D) : Distance between cluster A and cluster D .
• d(B, D) : Distance between cluster B and cluster D .
• d(A, B) : Distance between the merged clusters A and B .
• \alpha_A and \alpha_B : Weighting coefficients for d(A, D) and d(B, D) .
• \beta : Weighting factor for d(A, B) .
• \gamma : Weighting factor for the absolute difference |d(A, D) - d(B, D)| .
The values of \alpha_A , \alpha_B , \beta , and \gamma depend on the linkage method being used.
Special Cases of Linkage Methods Using Lance-Williams:
- Single Linkage (Nearest Neighbor):• \alpha_A = \frac{1}{2}, \, \alpha_B = \frac{1}{2}, \, \beta = 0, \, \gamma = -\frac{1}{2}
d(C, D) = \frac{1}{2} d(A, D) + \frac{1}{2} d(B, D) - \frac{1}{2} |d(A, D) - d(B, D)|
• Distance is based on the minimum distance between any two points from the two clusters.
- Complete Linkage (Farthest Neighbor):• \alpha_A = \frac{1}{2}, \, \alpha_B = \frac{1}{2}, \, \beta = 0, \, \gamma = \frac{1}{2}
d(C, D) = \frac{1}{2} d(A, D) + \frac{1}{2} d(B, D) + \frac{1}{2} |d(A, D) - d(B, D)|
• Distance is based on the maximum distance between any two points from the two clusters.
- Average Linkage:• \alpha_A = \frac{|A|}{|A| + |B|}, \, \alpha_B = \frac{|B|}{|A| + |B|}, \, \beta = 0, \, \gamma = 0
d(C, D) = \frac{|A|}{|A| + |B|} d(A, D) + \frac{|B|}{|A| + |B|} d(B, D)
• Distance is the weighted average of the distances from D to A and B , based on the cluster sizes.
- Centroid Linkage:• \alpha_A = \frac{|A|}{|A| + |B|}, \, \alpha_B = \frac{|B|}{|A| + |B|}, \, \beta = -\frac{|A| \cdot |B|}{(|A| + |B|)^2}, \, \gamma = 0
d(C, D) = \frac{|A|}{|A| + |B|} d(A, D) + \frac{|B|}{|A| + |B|} d(B, D) - \frac{|A| \cdot |B|}{(|A| + |B|)^2} d(A, B)
• Distance is based on the centroids of the clusters and their relative sizes.
- Ward’s Method:• \alpha_A = \frac{|A| + |D|}{|A| + |B| + |D|}, \, \alpha_B = \frac{|B| + |D|}{|A| + |B| + |D|}, \, \beta = -\frac{|D|}{|A| + |B| + |D|}, \, \gamma = 0
d(C, D) = \frac{|A| + |D|}{|A| + |B| + |D|} d(A, D) + \frac{|B| + |D|}{|A| + |B| + |D|} d(B, D) - \frac{|D|}{|A| + |B| + |D|} d(A, B)
• Distance is determined by minimizing the total within-cluster variance.
Key Features of Lance-Williams Method:
1. Flexibility: By adjusting the coefficients ( \alpha_A, \alpha_B, \beta, \gamma ), the method can implement different linkage strategies. 2. Efficiency: Allows dynamic updating of the distance matrix without recalculating pairwise distances, reducing computational cost. 3. Unified Framework: All popular hierarchical clustering methods can be expressed using this formula.
Workflow Using Lance-Williams:
1. Start with a distance matrix for individual points. 2. Select a linkage method (e.g., single, complete, Ward’s). 3. Use the Lance-Williams formula to compute distances when merging clusters. 4. Repeat until all points are merged into a single cluster.
The Lance-Williams method streamlines hierarchical clustering by offering a single, adaptable framework for calculating inter-cluster distances.
Refer to image DM2. How what are the results of merging the closest clusters?
Answer in the same image.
What is the complexity of implementing an ordinary hierarchical clustering?
• O(N2) space since it uses the proximity matrix (N is the number of points)
• There are N steps and at each step the size, N2, proximity matrix must be updated and
searched. O(N3) time in many cases
• Complexity can be reduced to O(N2 log(N)) time for some approaches
How could we implement the construction of hierarchical clustering in a more efficient way?
• Compute the distance between all pairs of points [O(N2)]
• Insert the pairs and their distances into a priority queue to fine the min in one step [O(N2)]
• When two clusters are merged, we remove all entries in the priority queue involving one of these two clusters [O(N log N)]
• Compute all the distances between the new cluster and the remaining clusters [O(NlogN)]
• Since the last two steps are executed at most N time, the overall complexity is O(N2logN)
Which partition generated by hierarchical clustering should we choose?
Ideally a good clustering should data partition points so that …
Data points in the same cluster should have a small distance from one another
Data points in different clusters should be at a large distance from one another.
How can we evaluate the quality of a clustering solution?
• Internal Validation Measures
– Employ criteria that are derived from the data itself
– For instance, intracluster and intercluster distances to measure cluster cohesion and separation
– Cohesion evaluates how similar are the points in the same cluster
– Separation, how far apart are the points in different clusters
• External Validation Measures
– Use prior or expert-specified knowledge about the clusters
– For example, we cluster the Iris data using the four input variables, then we evaluate the clustering using known class labels
– Employs criteria that are not inherent to the dataset
• Relative Validation Measures
– Aim to directly compare different solutions, usually those obtained via different parameter settings for the same algorithm.
Detail how to calculate the internal measures of quality of a cluster called cohesion and separation
Internal Measures of Cluster Quality: Cohesion and Separation
Cohesion and separation are commonly used internal cluster validation metrics that measure the quality of clusters based on the compactness (cohesion) and distinctiveness (separation) of clusters. These metrics do not require ground truth labels and are computed directly from the dataset.
- Cohesion: Intra-Cluster Compactness
Definition:
Cohesion measures how closely the points within a cluster are related to each other. It reflects the compactness of a cluster: clusters with low cohesion are considered poorly formed.
Formula:
For a cluster C_i , cohesion can be calculated as the sum of the pairwise distances between all points in the cluster or the sum of distances to the centroid:
1. Sum of Pairwise Distances:
\text{Cohesion}(C_i) = \sum_{x, y \in C_i} d(x, y)
Where:
• x, y : Data points in cluster C_i .
• d(x, y) : Distance between points x and y (e.g., Euclidean distance).
2. Sum of Distances to Centroid:
\text{Cohesion}(C_i) = \sum_{x \in C_i} d(x, \mu_i)
Where:
• \mu_i : Centroid (mean) of cluster C_i .
• d(x, \mu_i) : Distance between point x and the cluster centroid.
Interpretation:
• Lower cohesion values indicate a more compact cluster.
• Used to evaluate if points within clusters are close to each other.
- Separation: Inter-Cluster Distinctiveness
Definition:
Separation measures how distinct or far apart different clusters are from each other. It reflects the distance between clusters: clusters with high separation are well-separated.
Formula:
For clusters C_i and C_j , separation can be calculated as:
1. Centroid-Based Distance:
\text{Separation}(C_i, C_j) = d(\mu_i, \mu_j)
Where:
• \mu_i , \mu_j : Centroids of clusters C_i and C_j .
• d(\mu_i, \mu_j) : Distance between the centroids (e.g., Euclidean distance).
2. Minimum Distance (Single Linkage):
\text{Separation}(C_i, C_j) = \min_{x \in C_i, y \in C_j} d(x, y)
3. Maximum Distance (Complete Linkage):
\text{Separation}(C_i, C_j) = \max_{x \in C_i, y \in C_j} d(x, y)
4. Average Distance:
\text{Separation}(C_i, C_j) = \frac{1}{|C_i| |C_j|} \sum_{x \in C_i} \sum_{y \in C_j} d(x, y)
Overall Separation:
If there are multiple clusters, separation can be averaged across all cluster pairs:
\text{Overall Separation} = \frac{1}{\binom{k}{2}} \sum_{i=1}^{k-1} \sum_{j=i+1}^{k} \text{Separation}(C_i, C_j)
Where k is the total number of clusters.
Interpretation:
• Higher separation values indicate well-separated clusters.
• Used to evaluate if clusters are sufficiently distinct from one another.
Combining Cohesion and Separation
A good clustering solution exhibits low cohesion (compact clusters) and high separation (distinct clusters). These metrics can be combined into a single measure, such as the Silhouette Coefficient:
s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}
Where:
• a(i) : Cohesion of a point i , the average distance to other points in the same cluster.
• b(i) : Separation of a point i , the average distance to points in the nearest neighboring cluster.
The Silhouette Coefficient ranges from -1 to 1, where:
• Values close to 1: Well-clustered data.
• Values close to 0: Overlapping clusters.
• Values close to -1: Poorly clustered data.
Practical Use
1. Cohesion: • Use the centroid-based calculation for large datasets for efficiency. • Use pairwise distance calculations for smaller datasets or when fine-grained analysis is needed. 2. Separation: • Centroid-based calculations are sufficient for most clustering applications. • Minimum or maximum distance is useful for linkage-specific validations.
These measures guide the choice of clustering algorithms and validate results to ensure that clusters are meaningful and well-separated.
What is the knee/elbow analysis? How can we use it?
The knee/elbow method is a heuristic technique used to determine the optimal number of clusters ( k ) in clustering algorithms
Evaluation of Hierarchical Clustering using Knee/Elbow Analysis
Plot the cohesion and separation for every clustering
Look for a knee/elbow in the plot that show a significant variation of the evaluation metrics
When you are looking at clustering you are looking for several solutions and the coherence between them
What is a way to improve the knee/elbow analysis?
We could scan the different configurations proposed and count/evaluate the size of the clusters
Analyzing cluster size while performing the knee/elbow analysis is a good practice because it helps assess the distribution of data points among clusters, ensuring that the chosen number of clusters ( k ) produces a meaningful and interpretable partitioning of the data. Here’s why this is important:
- Avoids Small or Unbalanced Clusters• When k increases, it can lead to small or unbalanced clusters, where some clusters have very few data points.
• Such clusters may not represent meaningful patterns in the data and can result from overfitting the clustering process.
Why it matters:
• Clustering aims to group data meaningfully. Large discrepancies in cluster sizes may indicate that the clustering result is dominated by noise or that k is too large.
- Prevents Over-Segmentation• Increasing k often reduces WCSS because more clusters better approximate the data distribution.
• However, excessive k can lead to over-segmentation, splitting meaningful clusters into smaller, redundant groups.
Why it matters:
• Analyzing cluster sizes ensures that clusters remain large enough to capture real structure in the data rather than splitting data arbitrarily to minimize WCSS.
- Enhances Interpretability• Clusters that are too small or too many are harder to interpret and may not align with the goals of the analysis.
• By analyzing cluster sizes, you can identify whether the chosen k produces clusters that are practically useful for your application.
Why it matters:
• For instance, in customer segmentation, having clusters with only a handful of customers might not provide actionable insights.
- Identifies Potential Noise or Outliers• Very small clusters can indicate the presence of noise or outliers in the data.
• Observing cluster sizes during the elbow analysis can help identify these anomalies and guide decisions on whether to preprocess the data further.
Why it matters:
• Removing or separately analyzing outliers can lead to more robust clustering results.
- Supports Better Validation• Internal validation measures like cohesion and separation, or external validation metrics, can be misleading if some clusters have disproportionately small or large sizes.
• Analyzing cluster sizes provides additional context to validate whether the chosen k produces a balanced and coherent solution. - Helps Fine-Tune k Selection• The elbow point may not always correspond to the “best” k in terms of application-specific needs.
• By considering cluster sizes, you can refine the choice of k to strike a balance between compact clusters (low WCSS) and practical utility (balanced sizes).
Example of Why Cluster Size Matters:
Scenario:
Suppose you are performing k-means clustering to segment customers.
1. Without Analyzing Cluster Sizes:
• Elbow analysis suggests k = 6 .
• However, analyzing the cluster sizes reveals that one cluster has 95% of the data, while the other five clusters are tiny, containing only a few customers each.
2. With Analyzing Cluster Sizes:
• Adjust k to a value where the clusters are more balanced (e.g., k = 3 ), providing more meaningful groupings of customers.
Best Practices for Analyzing Cluster Sizes:
1. Plot Cluster Sizes: • Visualize cluster sizes for different k (e.g., bar chart or histogram). • Look for balanced distributions or detect extreme outliers. 2. Evaluate Distribution: • Avoid clusters that are disproportionately large or small unless justified by the data’s structure. 3. Combine with Other Metrics: • Use cohesion, separation, or the Silhouette Score to assess whether clusters are compact and well-separated. 4. Domain Knowledge: • Leverage domain expertise to determine whether the cluster sizes align with expectations and practical goals.
By considering cluster sizes alongside WCSS during the elbow analysis, you ensure that the chosen k produces a meaningful, interpretable, and balanced clustering solution, avoiding pitfalls like over-segmentation or the creation of irrelevant clusters.