Clustering Flashcards

1
Q

What is Hierarchical clustering?

A

By far, one of the most common clustering techniques
Produces a hierarchy (dendrogram) of nested clusters that can be analyzed and visualized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the approaches to implement hierarchical clustering?

A

• Agglomerative
–Start with individual clusters
– At each step, merge the closest pair of clusters until only one cluster (or k clusters) left

• Divisive
–Start with one cluster containing all the data points
– At each step, split a cluster until each cluster contains a point (or there are k clusters)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the most used approach to build hierarchical clusters and what is the key operation behind it?

A

Agglomerative hierarchical clustering is
the most popular hierarchical clustering technique

Key operation is the computation of the proximity of two clusters

Different approaches to defining the distance between clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can we calculate the distance between different clusters?

A

When performing hierarchical clustering, the distance between clusters is determined by a linkage method. The choice of linkage method affects the structure of the resulting dendrogram and depends on the nature of the data and the clustering objective. Here are the main methods for calculating distances between clusters:

  1. Single Linkage (Nearest Neighbor)• Definition: The distance between two clusters is defined as the minimum distance between any pair of points in the two clusters.

d(\text{Cluster}_1, \text{Cluster}2) = \min{x \in \text{Cluster}_1, y \in \text{Cluster}_2} d(x, y)

•	Characteristics:
•	Creates elongated or “chained” clusters.
•	Sensitive to noise and outliers because they can link distant points.
•	Use case: When you expect clusters to have an irregular or elongated shape.
  1. Complete Linkage (Farthest Neighbor)• Definition: The distance between two clusters is defined as the maximum distance between any pair of points in the two clusters.

d(\text{Cluster}_1, \text{Cluster}2) = \max{x \in \text{Cluster}_1, y \in \text{Cluster}_2} d(x, y)

•	Characteristics:
•	Tends to create compact and spherical clusters.
•	Less sensitive to noise compared to single linkage.
•	Use case: When you want compact clusters and can tolerate splitting elongated ones.
  1. Average Linkage• Definition: The distance between two clusters is the average distance between all pairs of points (one from each cluster).

d(\text{Cluster}_1, \text{Cluster}2) = \frac{1}{|C_1| \cdot |C_2|} \sum{x \in \text{Cluster}1} \sum{y \in \text{Cluster}_2} d(x, y)

•	Characteristics:
•	Balances the compactness of clusters and their proximity.
•	Produces moderate cluster shapes.
•	Use case: General-purpose clustering when you want a balance between compactness and flexibility.
  1. Centroid Linkage• Definition: The distance between two clusters is the distance between their centroids (mean points).

d(\text{Cluster}_1, \text{Cluster}_2) = d(\text{Centroid}_1, \text{Centroid}_2)

•	Characteristics:
•	Sensitive to outliers because centroids are affected by extreme points.
•	May produce non-hierarchical structures if centroids shift and cause cluster inversions.
•	Use case: When clusters are expected to be approximately spherical and of similar size.
  1. Ward’s Linkage (Minimum Variance Method)• Definition: The distance between two clusters is the increase in variance for the merged cluster. This minimizes the total within-cluster variance.

d(\text{Cluster}_1, \text{Cluster}_2) = \Delta \text{Error Sum of Squares (ESS)}

•	Characteristics:
•	Produces compact and spherical clusters.
•	Less sensitive to outliers compared to centroid linkage.
•	Use case: When reducing within-cluster variance is a priority, often used in practice.

Key Considerations for Choosing a Method

1.	Data Shape:
•	Irregular/elongated: Use single linkage.
•	Compact/spherical: Use complete or Ward’s linkage.
2.	Sensitivity to Outliers:
•	More sensitive: Single and centroid linkage.
•	Less sensitive: Complete and Ward’s linkage.
3.	Cluster Sizes:
•	Similar size: Ward’s or centroid linkage.
•	Different sizes: Single or average linkage.

Example Workflow:

1.	Compute a distance matrix (e.g., Euclidean, Manhattan).
2.	Select a linkage method.
3.	Merge clusters iteratively based on the selected distance metric.
4.	Visualize the clustering hierarchy using a dendrogram.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Detail the distance calculation using the Ward’s method

A

Ward’s Method: Distance Calculation in Hierarchical Clustering

Ward’s method, also known as the minimum variance method, is a hierarchical clustering technique that minimizes the increase in the total within-cluster variance (error sum of squares, ESS) when merging two clusters. This method produces compact and spherical clusters and is one of the most commonly used linkage methods.

Key Idea:

When two clusters A and B are merged into a new cluster C , the distance between A and B is determined by the increase in the total within-cluster variance (or ESS).

Formula for Ward’s Distance:

The Ward’s distance d(A, B) between two clusters A and B is given by:

d(A, B) = \frac{|A||B|}{|A| + |B|} | \mu_A - \mu_B |^2

Where:
• |A| and |B| : Number of data points in clusters A and B , respectively.
• \mu_A and \mu_B : Centroids (mean vectors) of clusters A and B , respectively.
• | \mu_A - \mu_B |^2 : Squared Euclidean distance between the centroids of clusters A and B .

Steps to Calculate Ward’s Distance:

1.	Compute the Centroids:
•	For cluster  A , calculate:

\mu_A = \frac{1}{|A|} \sum_{x \in A} x

•	Similarly, for cluster  B , calculate:

\mu_B = \frac{1}{|B|} \sum_{x \in B} x

2.	Calculate the Squared Euclidean Distance Between Centroids:

Here, d is the number of dimensions in the data.
3. Scale by Cluster Sizes:
• Multiply by the product of the cluster sizes ( |A||B| ).
• Divide by the total size of the merged cluster ( |A| + |B| ).
4. Result:
The resulting distance represents the increase in within-cluster variance if clusters A and B are merged.

Why Use Ward’s Method?

Ward’s method explicitly minimizes the total within-cluster variance at each step. This ensures:
• Clusters are compact (low variance).
• The clustering process is stable, avoiding early chaining effects seen in methods like single linkage.

Example:

Given:

•	Cluster  A  contains  |A| = 3  points with centroid  \mu_A = (2, 3) .
•	Cluster  B  contains  |B| = 2  points with centroid  \mu_B = (4, 6) .

Step 1: Compute | \mu_A - \mu_B |^2 :

| \mu_A - \mu_B |^2 = (2 - 4)^2 + (3 - 6)^2 = 4 + 9 = 13

Step 2: Compute d(A, B) :

d(A, B) = \frac{|A||B|}{|A| + |B|} | \mu_A - \mu_B |^2

Substitute values:

d(A, B) = \frac{3 \cdot 2}{3 + 2} \cdot 13 = \frac{6}{5} \cdot 13 = 15.6

Thus, the Ward’s distance between clusters A and B is 15.6.

Summary of Ward’s Method:

•	Focus: Minimizing the increase in within-cluster variance at each merge.
•	Key Calculation: Combines cluster sizes and centroid distances to balance compactness and representativeness.
•	Use Case: Works well for compact, spherical clusters and is robust

\mu_A - \mu_B |^2 = \sum_{i=1}^d (\mu_{A,i} - \mu_{B,i})^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Detail the distance calculation using the Lance Willian’s method

A

Lance-Williams Method: Distance Calculation in Hierarchical Clustering

The Lance-Williams formula is a general and flexible framework used to compute distances between clusters during hierarchical clustering. It supports a variety of linkage methods (e.g., single, complete, average, Ward’s) by defining a unified formula that updates the distance matrix as clusters are merged.

Formula for the Lance-Williams Update:

When two clusters A and B are merged into a new cluster C , the distance between C and any other cluster D is given by:

d(C, D) = \alpha_A d(A, D) + \alpha_B d(B, D) + \beta d(A, B) + \gamma |d(A, D) - d(B, D)|

Where:
• d(A, D) : Distance between cluster A and cluster D .
• d(B, D) : Distance between cluster B and cluster D .
• d(A, B) : Distance between the merged clusters A and B .
• \alpha_A and \alpha_B : Weighting coefficients for d(A, D) and d(B, D) .
• \beta : Weighting factor for d(A, B) .
• \gamma : Weighting factor for the absolute difference |d(A, D) - d(B, D)| .

The values of \alpha_A , \alpha_B , \beta , and \gamma depend on the linkage method being used.

Special Cases of Linkage Methods Using Lance-Williams:

  1. Single Linkage (Nearest Neighbor):• \alpha_A = \frac{1}{2}, \, \alpha_B = \frac{1}{2}, \, \beta = 0, \, \gamma = -\frac{1}{2}

d(C, D) = \frac{1}{2} d(A, D) + \frac{1}{2} d(B, D) - \frac{1}{2} |d(A, D) - d(B, D)|

•	Distance is based on the minimum distance between any two points from the two clusters.
  1. Complete Linkage (Farthest Neighbor):• \alpha_A = \frac{1}{2}, \, \alpha_B = \frac{1}{2}, \, \beta = 0, \, \gamma = \frac{1}{2}

d(C, D) = \frac{1}{2} d(A, D) + \frac{1}{2} d(B, D) + \frac{1}{2} |d(A, D) - d(B, D)|

•	Distance is based on the maximum distance between any two points from the two clusters.
  1. Average Linkage:• \alpha_A = \frac{|A|}{|A| + |B|}, \, \alpha_B = \frac{|B|}{|A| + |B|}, \, \beta = 0, \, \gamma = 0

d(C, D) = \frac{|A|}{|A| + |B|} d(A, D) + \frac{|B|}{|A| + |B|} d(B, D)

•	Distance is the weighted average of the distances from  D  to  A  and  B , based on the cluster sizes.
  1. Centroid Linkage:• \alpha_A = \frac{|A|}{|A| + |B|}, \, \alpha_B = \frac{|B|}{|A| + |B|}, \, \beta = -\frac{|A| \cdot |B|}{(|A| + |B|)^2}, \, \gamma = 0

d(C, D) = \frac{|A|}{|A| + |B|} d(A, D) + \frac{|B|}{|A| + |B|} d(B, D) - \frac{|A| \cdot |B|}{(|A| + |B|)^2} d(A, B)

•	Distance is based on the centroids of the clusters and their relative sizes.
  1. Ward’s Method:• \alpha_A = \frac{|A| + |D|}{|A| + |B| + |D|}, \, \alpha_B = \frac{|B| + |D|}{|A| + |B| + |D|}, \, \beta = -\frac{|D|}{|A| + |B| + |D|}, \, \gamma = 0

d(C, D) = \frac{|A| + |D|}{|A| + |B| + |D|} d(A, D) + \frac{|B| + |D|}{|A| + |B| + |D|} d(B, D) - \frac{|D|}{|A| + |B| + |D|} d(A, B)

•	Distance is determined by minimizing the total within-cluster variance.

Key Features of Lance-Williams Method:

1.	Flexibility: By adjusting the coefficients ( \alpha_A, \alpha_B, \beta, \gamma ), the method can implement different linkage strategies.
2.	Efficiency: Allows dynamic updating of the distance matrix without recalculating pairwise distances, reducing computational cost.
3.	Unified Framework: All popular hierarchical clustering methods can be expressed using this formula.

Workflow Using Lance-Williams:

1.	Start with a distance matrix for individual points.
2.	Select a linkage method (e.g., single, complete, Ward’s).
3.	Use the Lance-Williams formula to compute distances when merging clusters.
4.	Repeat until all points are merged into a single cluster.

The Lance-Williams method streamlines hierarchical clustering by offering a single, adaptable framework for calculating inter-cluster distances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Refer to image DM2. How what are the results of merging the closest clusters?

A

Answer in the same image.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the complexity of implementing an ordinary hierarchical clustering?

A

• O(N2) space since it uses the proximity matrix (N is the number of points)
• There are N steps and at each step the size, N2, proximity matrix must be updated and
searched. O(N3) time in many cases
• Complexity can be reduced to O(N2 log(N)) time for some approaches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How could we implement the construction of hierarchical clustering in a more efficient way?

A

• Compute the distance between all pairs of points [O(N2)]
• Insert the pairs and their distances into a priority queue to fine the min in one step [O(N2)]
• When two clusters are merged, we remove all entries in the priority queue involving one of these two clusters [O(N log N)]
• Compute all the distances between the new cluster and the remaining clusters [O(NlogN)]
• Since the last two steps are executed at most N time, the overall complexity is O(N2logN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which partition generated by hierarchical clustering should we choose?

A

Ideally a good clustering should data partition points so that …
Data points in the same cluster should have a small distance from one another
Data points in different clusters should be at a large distance from one another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can we evaluate the quality of a clustering solution?

A

• Internal Validation Measures
– Employ criteria that are derived from the data itself
– For instance, intracluster and intercluster distances to measure cluster cohesion and separation
– Cohesion evaluates how similar are the points in the same cluster
– Separation, how far apart are the points in different clusters

• External Validation Measures
– Use prior or expert-specified knowledge about the clusters
– For example, we cluster the Iris data using the four input variables, then we evaluate the clustering using known class labels
– Employs criteria that are not inherent to the dataset

• Relative Validation Measures
– Aim to directly compare different solutions, usually those obtained via different parameter settings for the same algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Detail how to calculate the internal measures of quality of a cluster called cohesion and separation

A

Internal Measures of Cluster Quality: Cohesion and Separation

Cohesion and separation are commonly used internal cluster validation metrics that measure the quality of clusters based on the compactness (cohesion) and distinctiveness (separation) of clusters. These metrics do not require ground truth labels and are computed directly from the dataset.

  1. Cohesion: Intra-Cluster Compactness

Definition:
Cohesion measures how closely the points within a cluster are related to each other. It reflects the compactness of a cluster: clusters with low cohesion are considered poorly formed.

Formula:
For a cluster C_i , cohesion can be calculated as the sum of the pairwise distances between all points in the cluster or the sum of distances to the centroid:
1. Sum of Pairwise Distances:

\text{Cohesion}(C_i) = \sum_{x, y \in C_i} d(x, y)

Where:
• x, y : Data points in cluster C_i .
• d(x, y) : Distance between points x and y (e.g., Euclidean distance).
2. Sum of Distances to Centroid:

\text{Cohesion}(C_i) = \sum_{x \in C_i} d(x, \mu_i)

Where:
• \mu_i : Centroid (mean) of cluster C_i .
• d(x, \mu_i) : Distance between point x and the cluster centroid.

Interpretation:
• Lower cohesion values indicate a more compact cluster.
• Used to evaluate if points within clusters are close to each other.

  1. Separation: Inter-Cluster Distinctiveness

Definition:
Separation measures how distinct or far apart different clusters are from each other. It reflects the distance between clusters: clusters with high separation are well-separated.

Formula:
For clusters C_i and C_j , separation can be calculated as:
1. Centroid-Based Distance:

\text{Separation}(C_i, C_j) = d(\mu_i, \mu_j)

Where:
• \mu_i , \mu_j : Centroids of clusters C_i and C_j .
• d(\mu_i, \mu_j) : Distance between the centroids (e.g., Euclidean distance).
2. Minimum Distance (Single Linkage):

\text{Separation}(C_i, C_j) = \min_{x \in C_i, y \in C_j} d(x, y)

3.	Maximum Distance (Complete Linkage):

\text{Separation}(C_i, C_j) = \max_{x \in C_i, y \in C_j} d(x, y)

4.	Average Distance:

\text{Separation}(C_i, C_j) = \frac{1}{|C_i| |C_j|} \sum_{x \in C_i} \sum_{y \in C_j} d(x, y)

Overall Separation:
If there are multiple clusters, separation can be averaged across all cluster pairs:

\text{Overall Separation} = \frac{1}{\binom{k}{2}} \sum_{i=1}^{k-1} \sum_{j=i+1}^{k} \text{Separation}(C_i, C_j)

Where k is the total number of clusters.

Interpretation:
• Higher separation values indicate well-separated clusters.
• Used to evaluate if clusters are sufficiently distinct from one another.

Combining Cohesion and Separation

A good clustering solution exhibits low cohesion (compact clusters) and high separation (distinct clusters). These metrics can be combined into a single measure, such as the Silhouette Coefficient:

s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

Where:
• a(i) : Cohesion of a point i , the average distance to other points in the same cluster.
• b(i) : Separation of a point i , the average distance to points in the nearest neighboring cluster.

The Silhouette Coefficient ranges from -1 to 1, where:
• Values close to 1: Well-clustered data.
• Values close to 0: Overlapping clusters.
• Values close to -1: Poorly clustered data.

Practical Use

1.	Cohesion:
•	Use the centroid-based calculation for large datasets for efficiency.
•	Use pairwise distance calculations for smaller datasets or when fine-grained analysis is needed.
2.	Separation:
•	Centroid-based calculations are sufficient for most clustering applications.
•	Minimum or maximum distance is useful for linkage-specific validations.

These measures guide the choice of clustering algorithms and validate results to ensure that clusters are meaningful and well-separated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the knee/elbow analysis? How can we use it?

A

The knee/elbow method is a heuristic technique used to determine the optimal number of clusters ( k ) in clustering algorithms

Evaluation of Hierarchical Clustering using Knee/Elbow Analysis

Plot the cohesion and separation for every clustering

Look for a knee/elbow in the plot that show a significant variation of the evaluation metrics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When you are looking at clustering you are looking for several solutions and the coherence between them

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a way to improve the knee/elbow analysis?

A

We could scan the different configurations proposed and count/evaluate the size of the clusters

Analyzing cluster size while performing the knee/elbow analysis is a good practice because it helps assess the distribution of data points among clusters, ensuring that the chosen number of clusters ( k ) produces a meaningful and interpretable partitioning of the data. Here’s why this is important:

  1. Avoids Small or Unbalanced Clusters• When k increases, it can lead to small or unbalanced clusters, where some clusters have very few data points.
    • Such clusters may not represent meaningful patterns in the data and can result from overfitting the clustering process.

Why it matters:
• Clustering aims to group data meaningfully. Large discrepancies in cluster sizes may indicate that the clustering result is dominated by noise or that k is too large.

  1. Prevents Over-Segmentation• Increasing k often reduces WCSS because more clusters better approximate the data distribution.
    • However, excessive k can lead to over-segmentation, splitting meaningful clusters into smaller, redundant groups.

Why it matters:
• Analyzing cluster sizes ensures that clusters remain large enough to capture real structure in the data rather than splitting data arbitrarily to minimize WCSS.

  1. Enhances Interpretability• Clusters that are too small or too many are harder to interpret and may not align with the goals of the analysis.
    • By analyzing cluster sizes, you can identify whether the chosen k produces clusters that are practically useful for your application.

Why it matters:
• For instance, in customer segmentation, having clusters with only a handful of customers might not provide actionable insights.

  1. Identifies Potential Noise or Outliers• Very small clusters can indicate the presence of noise or outliers in the data.
    • Observing cluster sizes during the elbow analysis can help identify these anomalies and guide decisions on whether to preprocess the data further.

Why it matters:
• Removing or separately analyzing outliers can lead to more robust clustering results.

  1. Supports Better Validation• Internal validation measures like cohesion and separation, or external validation metrics, can be misleading if some clusters have disproportionately small or large sizes.
    • Analyzing cluster sizes provides additional context to validate whether the chosen k produces a balanced and coherent solution.
  2. Helps Fine-Tune k Selection• The elbow point may not always correspond to the “best” k in terms of application-specific needs.
    • By considering cluster sizes, you can refine the choice of k to strike a balance between compact clusters (low WCSS) and practical utility (balanced sizes).

Example of Why Cluster Size Matters:

Scenario:

Suppose you are performing k-means clustering to segment customers.
1. Without Analyzing Cluster Sizes:
• Elbow analysis suggests k = 6 .
• However, analyzing the cluster sizes reveals that one cluster has 95% of the data, while the other five clusters are tiny, containing only a few customers each.
2. With Analyzing Cluster Sizes:
• Adjust k to a value where the clusters are more balanced (e.g., k = 3 ), providing more meaningful groupings of customers.

Best Practices for Analyzing Cluster Sizes:

1.	Plot Cluster Sizes:
•	Visualize cluster sizes for different  k  (e.g., bar chart or histogram).
•	Look for balanced distributions or detect extreme outliers.
2.	Evaluate Distribution:
•	Avoid clusters that are disproportionately large or small unless justified by the data’s structure.
3.	Combine with Other Metrics:
•	Use cohesion, separation, or the Silhouette Score to assess whether clusters are compact and well-separated.
4.	Domain Knowledge:
•	Leverage domain expertise to determine whether the cluster sizes align with expectations and practical goals.

By considering cluster sizes alongside WCSS during the elbow analysis, you ensure that the chosen k produces a meaningful, interpretable, and balanced clustering solution, avoiding pitfalls like over-segmentation or the creation of irrelevant clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is cluster stability? How can we calculate it?

A

Cluster Stability

Cluster stability refers to the consistency or robustness of a clustering solution across different variations of the data or algorithm parameters. It measures how much the clusters remain unchanged when the data is perturbed or the clustering algorithm is re-run with slight modifications.

Stable clusters are desirable as they indicate that the clustering results are not overly sensitive to noise, random initialization, or other small variations in the data or algorithm.

Why Cluster Stability Matters

1.	Reliability: Stability ensures that the clustering reflects inherent patterns in the data rather than artifacts of the algorithm or data preprocessing.
2.	Model Selection: Stability can help select the optimal number of clusters (k), complementing metrics like cohesion, separation, and elbow analysis.
3.	Interpretability: Stable clusters are easier to interpret and more likely to generalize well to unseen data.

How to Measure Cluster Stability

Cluster stability is often measured by comparing clustering results across multiple runs or perturbations. Below are common approaches:

  1. Stability Across Repeated Runs (Resampling)

This involves clustering the data multiple times and comparing the resulting cluster assignments. The steps are:
1. Resampling:
• Perturb the dataset by adding noise, removing a subset of data, or sampling data points randomly.
• Repeat the clustering process N times.
2. Comparison:
• Compare the cluster labels obtained in each run using measures like:
• Adjusted Rand Index (ARI): Quantifies the similarity between two clusterings by comparing their pairwise label agreements.
• Normalized Mutual Information (NMI): Measures the mutual dependence between two cluster assignments.
3. Aggregate Stability:
• Compute the average similarity across all pairs of runs to quantify stability:

\text{Stability} = \frac{1}{\binom{N}{2}} \sum_{i=1}^{N-1} \sum_{j=i+1}^{N} \text{Similarity}(\text{Clustering}_i, \text{Clustering}_j)

Interpretation:
• Higher values of ARI or NMI indicate greater stability.

  1. Stability Under Data Perturbation
    1. Perturb the Data:
      • Apply techniques like:
      • Adding random noise to features.
      • Bootstrapping (resampling the data with replacement).
      • Removing a subset of the data.
    2. Cluster the Perturbed Data:
      • Perform clustering on each perturbed version.
    3. Compare Assignments:
      • Use ARI, NMI, or another similarity metric to compare cluster labels between the original and perturbed datasets.
  2. Stability Across Parameter Changes

This approach evaluates how sensitive the clustering is to changes in algorithm parameters. For example:
• For k-means, vary the number of clusters k or the random initialization.
• For hierarchical clustering, vary the linkage method (e.g., single, complete, or average linkage).

Steps:
1. Perform clustering with different parameter settings.
2. Compare cluster assignments using ARI, NMI, or Jaccard similarity.
3. Analyze whether the clusters remain consistent.

  1. Prediction Strength

Prediction strength assesses whether clusters identified in one subset of the data can be reproduced in another subset. Steps:
1. Split the dataset into two halves.
2. Perform clustering on the first half and assign the second-half points to the clusters (using proximity to centroids, for example).
3. Measure agreement between the original clustering and the predicted clustering for the second half.

Prediction strength is calculated as:

\text{Prediction Strength} = \frac{\text{Number of consistent assignments}}{\text{Total number of assignments}}

Interpretation:
• Higher prediction strength indicates better stability.

  1. Visualization-Based Methods

Visualizing cluster stability can offer intuitive insights:
1. Consensus Matrix:
• Construct a matrix that records how often two points are assigned to the same cluster across multiple runs.
• Visualize this matrix as a heatmap to assess the consistency of point assignments.
2. Silhouette Analysis:
• Use silhouette scores to visualize and compare cluster consistency across different runs.

Practical Example of Cluster Stability Measurement

1.	Resampling Stability:
•	Generate 100 bootstrap samples from your dataset.
•	Perform clustering on each sample.
•	Use ARI or NMI to compare cluster assignments across all pairs of samples.
•	Compute the average ARI or NMI score as the stability measure.
2.	Parameter Stability:
•	Run k-means for k = 3, 4, 5, \dots, 10, with 10 random initializations for each k.
•	For each k, compute the variance of ARI or NMI scores across runs.
•	Select k where stability (low variance) is maximized.

When to Use Cluster Stability

•	When choosing the optimal  k .
•	To validate whether clustering results are reliable.
•	In applications where interpretability or robustness is critical.

Limitations of Cluster Stability

1.	Computationally expensive, especially for large datasets.
2.	Sensitive to the type of perturbation or resampling method.
3.	Stability does not always imply meaningful clusters; it must be used alongside external measures or domain expertise.

By quantifying the robustness of clustering results, cluster stability provides insights into the reliability of the discovered patterns and aids in model selection and validation.

17
Q

What is the difference between hierarchical clustering and representative-based clustering?

A

Hierarchical Clustering vs. Representative-Based Clustering

Hierarchical clustering and representative-based clustering (e.g., k-means) are two distinct types of clustering algorithms, each with its own approach, characteristics, and use cases. Here’s a detailed comparison:

  1. Hierarchical Clustering

Hierarchical clustering builds a hierarchy of clusters, either by merging smaller clusters into larger ones (agglomerative) or splitting larger clusters into smaller ones (divisive).

Characteristics:

•	Structure: Produces a dendrogram (tree-like structure), representing nested clusters at different levels.
•	Algorithm Types:
•	Agglomerative (bottom-up): Starts with individual data points as clusters and merges them iteratively based on distance until a single cluster forms.
•	Divisive (top-down): Starts with one cluster containing all data points and splits it iteratively into smaller clusters.
•	Distance Measures: Relies on distance metrics (e.g., Euclidean, Manhattan) and linkage criteria (e.g., single, complete, average) to determine cluster similarity.
•	Deterministic: Results are deterministic, meaning the output is the same every time for the same input and parameters.

Advantages:

1.	No need to specify the number of clusters in advance.
2.	Works well for hierarchical or nested relationships in the data.
3.	Can capture non-convex clusters.
4.	Provides a visual representation (dendrogram) to analyze cluster formations at different levels.

Disadvantages:

1.	Computationally expensive (time complexity:  O(n^2 \log n)  or higher).
2.	Sensitive to noise and outliers.
3.	Cannot handle large datasets efficiently.

Use Cases:

1.	Taxonomy or classification of species.
2.	Document clustering for topic modeling.
3.	Market segmentation where hierarchical relationships matter.
  1. Representative-Based Clustering (e.g., k-means, k-medoids)

Representative-based clustering relies on representatives (centroids or medoids) to partition the dataset into a pre-specified number of clusters.

Characteristics:

•	Structure: Divides the data into non-overlapping flat clusters.
•	Algorithm: Assigns points to the nearest representative, then updates representatives iteratively to minimize a specific objective function (e.g., within-cluster sum of squares for k-means).
•	Cluster Shape: Works best for spherical, convex, or isotropic clusters.
•	Non-deterministic: May yield different results due to random initialization of centroids (unless initialization methods like k-means++ are used).

Advantages:

1.	Computationally efficient (time complexity:  O(n \cdot k \cdot i) , where  i  is the number of iterations).
2.	Scales well to large datasets.
3.	Easy to implement and interpret.
4.	Effective for compact and well-separated clusters.

Disadvantages:

1.	Requires specifying the number of clusters ( k ) in advance.
2.	Sensitive to initialization (e.g., initial centroids in k-means).
3.	Poor performance on clusters of varying sizes, densities, or non-spherical shapes.
4.	Cannot capture hierarchical relationships.

Use Cases:

1.	Customer segmentation for marketing campaigns.
2.	Image compression and vector quantization.
3.	Grouping similar items in recommendation systems.

Key Differences:

Aspect Hierarchical Clustering Representative-Based Clustering
Cluster Formation Builds a hierarchy of clusters. Partitions the data into k flat clusters.
Number of Clusters Does not require pre-specifying k . Requires pre-defining the number of clusters k .
Output Produces a dendrogram (tree structure). Produces non-overlapping clusters.
Cluster Shape Captures arbitrary-shaped clusters. Assumes convex, isotropic clusters (e.g., spherical).
Computational Efficiency Slower, especially for large datasets. Faster and scalable to large datasets.
Sensitivity to Noise Sensitive to noise and outliers. More robust if preprocessing is done carefully.
Interpretability Provides insights at multiple levels of granularity. Directly partitions data into simpler structures.
Scalability Inefficient for large datasets ( O(n^2) ). Scales well ( O(n) for large datasets).
Determinism Deterministic (same output for same input). Non-deterministic unless initialization is fixed.

When to Use Which?

•	Hierarchical Clustering:
•	Use when you need to explore the hierarchical structure of the data.
•	Suitable for smaller datasets or when the number of clusters is not known beforehand.
•	Ideal for data with nested or tree-like relationships.
•	Representative-Based Clustering:
•	Use when the dataset is large and requires efficient clustering.
•	Suitable for datasets with clear, compact, spherical clusters.
•	Works well when the number of clusters ( k ) is pre-determined or can be estimated using methods like elbow analysis.

Both methods serve different purposes, and the choice depends on the dataset’s characteristics, the desired output, and computational constraints.

18
Q

How can we used the k-means algorithm to calculate representative-based clusters? What are its advantages and limitations?

A

Using the k-Means Algorithm for Representative-Based Clusters

The k-means algorithm is one of the most widely used methods for creating representative-based clusters. It partitions the data into k clusters by minimizing the variance within each cluster, represented by its centroid. Here’s how it works, along with its advantages and limitations.

Steps to Calculate Representative-Based Clusters Using k-Means

1.	Initialization:
•	Choose  k , the number of clusters.
•	Randomly initialize  k  centroids (or use smarter initializations like k-means++ for better results).
2.	Assignment Step:
•	Assign each data point to the nearest cluster centroid based on a distance metric, typically Euclidean distance:

\text{Cluster}(x_i) = \arg\min_j |x_i - c_j|^2

Where:
• x_i : Data point.
• c_j : Centroid of cluster j .
3. Update Step:
• Recalculate the centroid of each cluster by taking the mean of all data points assigned to it:

c_j = \frac{1}{|C_j|} \sum_{x \in C_j} x

Where C_j is the set of points in cluster j .
4. Repeat:
• Alternate between the assignment and update steps until:
• Cluster assignments no longer change, or
• A predefined maximum number of iterations is reached.
5. Output:
• The final k clusters, each represented by its centroid.

Advantages of k-Means

1.	Efficiency:
•	Computationally efficient with a time complexity of  O(n \cdot k \cdot i) , where:
•	 n : Number of data points.
•	 k : Number of clusters.
•	 i : Number of iterations.
•	Scales well to large datasets.
2.	Simplicity:
•	Easy to implement and understand.
•	Straightforward interpretation: Each cluster is represented by its centroid.
3.	Works Well for Convex Clusters:
•	Performs well when clusters are compact, spherical, and well-separated.
4.	Flexibility with Distance Metrics:
•	While traditionally using Euclidean distance, k-means can adapt to other metrics with slight modifications.
5.	Widely Used Applications:
•	Customer segmentation, image compression, document clustering, and more.

Limitations of k-Means

1.	Assumes Spherical Clusters:
•	k-means performs poorly when clusters are non-convex, non-spherical, or vary in size or density.
2.	Sensitivity to Initialization:
•	Results depend on the initial placement of centroids. Poor initialization can lead to suboptimal clustering or convergence to local minima.
•	Solution: Use improved methods like k-means++ for better initial centroids.
3.	Requires  k  to Be Pre-Specified:
•	The number of clusters ( k ) must be determined beforehand, which can be challenging without prior knowledge.
•	Solution: Use techniques like elbow method or silhouette score to estimate  k .
4.	Outliers Impact Results:
•	Outliers can significantly skew centroids and clustering results.
•	Solution: Preprocess the data to handle outliers.
5.	Sensitive to Scale:
•	Features with different scales can dominate clustering results.
•	Solution: Standardize or normalize the data before applying k-means.
6.	Non-Deterministic:
•	k-means may produce different results for the same dataset due to random initialization (unless initialization is fixed).
7.	Poor Performance for High-Dimensional Data:
•	The curse of dimensionality affects Euclidean distances, making clusters less distinguishable in high dimensions.

Practical Example

Suppose you are clustering customer data for segmentation based on annual income and spending score. Here’s how you can apply k-means:
1. Preprocess the Data:
• Normalize the income and spending score features.
• Handle outliers if present.
2. Determine k :
• Use the elbow method or silhouette score to estimate the optimal number of clusters.
3. Apply k-Means:
• Run k-means with the chosen k .
• Use k-means++ for initialization to avoid suboptimal centroids.
4. Analyze the Results:
• Examine the centroids to interpret cluster characteristics (e.g., high-income vs. low-income groups).
• Visualize the clusters if the data is low-dimensional.

When to Use k-Means

Use k-means when:
• The clusters are roughly spherical and balanced in size.
• You have a clear estimate of k .
• You need an efficient and simple clustering method for a large dataset.

Avoid k-means when:
• Clusters have irregular shapes or are not well-separated.
• There are many outliers or the data is high-dimensional and sparse.

By being aware of its strengths and limitations, k-means can be an effective tool for creating representative-based clusters in a wide range of applications.

19
Q

What is the computation complexity of k-means

A

• Cluster assignment takes O(nkd) time since, for each of the n points, it computes its distance to each of the k clusters, which takes d operations in d dimensions
• The centroid re-computation step takes O(nd) time to add n d-dimensional points
• Assuming that there are t iterations, the total time for K-means is given as O(tnkd).
• In terms of the I/O cost it requires O(t) full database scans, because we have to read the entire database in each iteration.

20
Q

What are some of the possible centroid initializations for the k-mean algorithm?

A

• Solution 1
–Pick points that are as far away from one another as possible.

• Solution 2
Pick the first point at random;
WHILE there are fewer than k points DO
Add the point whose minimum distance from the selected points is as large as possible;
END;

• Solution 3
– Cluster a sample of the data, perhaps hierarchically, so there are k clusters. Pick a point from each cluster, perhaps that point closest to the centroid of the cluster.

21
Q

What is a common problem that affects greed algorithms and also affects k-means? How could we solve it? Propose an algorithm to do it!

A

Initialization of the centroids.

Some of the solutions could be:
• Multiple runs, helps, but probability is not on your side
• Sample and use another clustering method (hierarchical?) to determine initial centroids
• Select more than k initial centroids and then select among these initial centroids
• Postprocessing
• Bisecting K-means, not as susceptible to initialization issues

22
Q

What is the bisecting k-means algorithm? What problems does it aims to solve?

A

Bisecting k-Means Algorithm

The bisecting k-means algorithm is a variant of the standard k-means clustering algorithm designed to address some of its limitations. It combines features of hierarchical clustering and k-means to produce a more balanced and efficient clustering solution.

How Bisecting k-Means Works

The algorithm iteratively splits the dataset into clusters using k-means in a top-down, hierarchical manner. Here’s the step-by-step process:
1. Start with All Data in One Cluster:
• Begin with the entire dataset treated as a single cluster.
2. Select a Cluster to Split:
• Choose the cluster to be split based on certain criteria, such as:
• The largest cluster (most common).
• The cluster with the highest variance or sum of squared distances.
3. Bisect the Cluster Using k-Means:
• Perform k-means with k = 2 on the selected cluster, dividing it into two sub-clusters.
• Repeat the k-means process M times (with different initializations), and keep the split that minimizes the total intra-cluster variance (or another objective function).
4. Repeat Until k Clusters Are Formed:
• Continue splitting clusters until the desired number of clusters ( k ) is reached.
5. Output:
• A set of k clusters, hierarchically derived through repeated bisections.

Key Features of Bisecting k-Means

1.	Hierarchical Nature:
•	The process is hierarchical, with a top-down approach to cluster formation.
2.	Focus on Cluster Balance:
•	By iteratively splitting clusters, it tends to produce clusters of similar size, avoiding overly large or small clusters that can occur in standard k-means.
3.	Improved Robustness:
•	Re-running k-means multiple times at each split helps find a more stable and accurate division.

Problems Solved by Bisecting k-Means

1.	Sensitivity to Initialization:
•	Standard k-means is sensitive to initial centroid placement, which can lead to suboptimal clustering. Bisecting k-means mitigates this by focusing on one cluster at a time and using multiple runs for each split.
2.	Imbalanced Clusters:
•	Standard k-means may produce clusters of widely varying sizes. Bisecting k-means prioritizes splitting larger or more dispersed clusters, promoting more balanced results.
3.	Hierarchical Structure:
•	Unlike standard k-means, bisecting k-means produces a hierarchical clustering structure, making it suitable for datasets with inherent hierarchical relationships.
4.	Scalability:
•	Bisecting k-means is more scalable than traditional hierarchical clustering algorithms, which have a higher computational complexity.
5.	Interpretability:
•	The hierarchical nature allows for step-by-step analysis of cluster formation, aiding interpretability.

Advantages of Bisecting k-Means

1.	Combines Strengths of k-Means and Hierarchical Clustering:
•	Achieves efficient clustering with a hierarchical structure.
2.	Handles Large Datasets:
•	More scalable than traditional hierarchical clustering algorithms like agglomerative clustering.
3.	Reduced Sensitivity to Initialization:
•	Multiple iterations for splitting improve robustness.
4.	Balanced Clusters:
•	Produces more evenly distributed clusters compared to standard k-means.
5.	Customizable Stopping Criteria:
•	Flexibility to choose the number of clusters or stop based on other metrics (e.g., total variance reduction).

Limitations of Bisecting k-Means

1.	Still Assumes Convex Clusters:
•	Like standard k-means, it works best for spherical, well-separated clusters.
2.	Computational Cost:
•	Running k-means multiple times for each split can increase computational requirements, especially for large datasets.
3.	Arbitrary Selection of Splits:
•	Criteria for selecting which cluster to split (e.g., largest size or highest variance) can influence results and may not always lead to optimal splits.
4.	Pre-Specified  k :
•	The number of clusters ( k ) still needs to be specified or estimated beforehand.

Use Cases of Bisecting k-Means

1.	Document Clustering:
•	Often used for clustering text documents, where clusters may have hierarchical relationships.
2.	Customer Segmentation:
•	Useful in cases where customers need to be grouped into balanced, hierarchical segments.
3.	Hierarchical Analysis:
•	Ideal for datasets with a natural hierarchy or when hierarchical relationships are important for interpretation.

Comparison to Standard k-Means

Aspect Standard k-Means Bisecting k-Means
Approach Partitions data into k clusters directly. Iteratively splits clusters into two sub-clusters.
Output Flat clusters. Hierarchical structure of clusters.
Cluster Balance Can result in imbalanced clusters. Produces more balanced clusters.
Initialization Sensitivity High sensitivity to initial centroids. Reduces sensitivity through multiple runs.
Scalability Highly scalable. Slightly less scalable due to repeated splits.

Practical Example

Suppose you have a dataset of customer transactions and want to create 5 customer segments:
1. Start with all customers in one cluster.
2. Use k = 2 k-means to split the cluster into two segments (repeat M times for stability).
3. Select the largest or most dispersed cluster and split it again.
4. Repeat until 5 clusters are formed.

The resulting clusters will represent balanced customer groups, potentially revealing hierarchical relationships among them.

By combining hierarchical and representative-based clustering, bisecting k-means provides an effective and interpretable solution for many real-world problems.

23
Q

What are some pre and post-processing that could be done when applying k-means?

A

• Pre-processing
–Normalize the data
–Eliminate outliers

• Post-processing
–Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high SSE
– Merge clusters that are ‘close’ and that have relatively low SSE
–These steps can be used during the clustering process

24
Q

Give an overview of key-means strengths, weaknesses, advantages, and disadvantages

A

• Strength
– Relatively efficient
– Often terminates at a local optimum
– The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms •

Weakness
– Applicable only when mean is defined, then what about categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes

• Advantages
– Simple, understandable
– Items automatically assigned to clusters

• Disadvantages
– Must pick number of clusters before hand – All items forced into a cluster
– Too sensitive to outliers

25
Q

How could we use the knee/elbow analysis to choose the number of clusters “k”?

A
26
Q

K-means has a random initialization. This means that every time I run it I might obtain a slightly different result. How could we solve it?

A

Run each configuration (num of k) for more then one time and take the best, average,…

27
Q

Define the mean shift clustering. How could we start the algorithm? Give a pseudo code for it.

A

Mean shift is an iterative, nonparametric, and versatile algorithm
It searches for the mode (i.e., the point of highest density) of a data distribution

Pseudo code:
1. Choose a search window size (the bandwidth)
2. Choose the initial location of the search window
3. Compute the mean location in the search window
4. Center the search window at the location computed in Step 3
5. Repeat Steps 3 and 4 until convergence

28
Q

Why don’t we always use mean shift instead of k-means? Compare them

A

• Number of clusters and shape
– k-means assumes that the number of clusters is already known and the clusters are shaped spherically
– Mean shift does not assume anything about number of clusters.
– Instead, the number of modes give the number of clusters
– Also, since mean shift is based on density estimation, it can handle arbitrarily shaped clusters.

• Initialization and outliers
– k-means is sensitive to initialization. Mean shift is robust to initializations.
– Typically, mean shift is run for each point or sometimes points are selected uniformly from the feature space
– Similarly, k-means is sensitive to outliers while Mean Shift is less sensitive.

• Time complexity

– k-means is fast and has a time complexity O(knT) where k is the number of clusters, n is the number of points
and T is the number of iterations.

– Classic mean shift is computationally expensive witha time complexity O(Tn2).

MeanShift clustering runs at O(Tn2), k-Means at O(knT) where T is the number of iterations and n is the number of data points

29
Q

Give an overview of mean shift clustering strengths and weaknesses

A

• Strength
–Does not assume any prior shape on data clusters
–Can handle arbitrary feature spaces
–Only the window size h to choose
–The window size h (or bandwidth) has a physical meaning

• Weaknesses
– The window size h (or bandwidth) selection is not trivial. Inappropriate window size can
cause modes to be merged, or generate additional “shallow” modes
–Adaptive window size can help
–Not suitable for high-dimensional features

30
Q

Define the Expectation-Maximization clustering

A

Expectation-Maximization (EM) Clustering

Expectation-Maximization (EM) clustering is a probabilistic clustering method used to group data points into clusters by estimating the parameters of an underlying probability distribution (usually a mixture of Gaussian distributions). It is commonly applied in Gaussian Mixture Models (GMMs) to model the data as a weighted sum of multiple Gaussian distributions.

How EM Clustering Works

The EM algorithm alternates between two main steps—Expectation (E) and Maximization (M)—to iteratively optimize the likelihood of the data given the current cluster parameters.
1. Initialization:
• Start with an initial guess for the parameters of the mixture model:
• Cluster means ( \mu_k ).
• Covariance matrices ( \Sigma_k ) for each cluster.
• Mixing coefficients ( \pi_k ), which represent the proportion of each cluster.
2. E-Step (Expectation Step):
• Calculate the responsibility r_{ik} , which is the probability that data point x_i belongs to cluster k , given the current parameters:

r_{ik} = \frac{\pi_k \cdot \mathcal{N}(x_i | \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \cdot \mathcal{N}(x_i | \mu_j, \Sigma_j)}

Where:
• \mathcal{N}(x_i | \mu_k, \Sigma_k) : Gaussian probability density function for cluster k .
3. M-Step (Maximization Step):
• Update the parameters of the model ( \mu_k, \Sigma_k, \pi_k ) based on the responsibilities computed in the E-step:
• Update the mean:

\mu_k = \frac{\sum_{i=1}^n r_{ik} x_i}{\sum_{i=1}^n r_{ik}}

•	Update the covariance matrix:

\Sigma_k = \frac{\sum_{i=1}^n r_{ik} (x_i - \mu_k)(x_i - \mu_k)^T}{\sum_{i=1}^n r_{ik}}

•	Update the mixing coefficient:

\pi_k = \frac{\sum_{i=1}^n r_{ik}}{n}

4.	Repeat:
•	Alternate between the E-step and M-step until convergence (i.e., the change in the likelihood function or parameters is below a threshold).
5.	Output:
•	The final model parameters ( \mu_k, \Sigma_k, \pi_k ) and the soft assignments (responsibilities  r_{ik} ) for each data point.

Key Features of EM Clustering

1.	Soft Clustering:
•	Unlike k-means, EM assigns probabilities to points for each cluster, allowing for soft clustering instead of hard assignments.
2.	Probabilistic Model:
•	Assumes data is generated from a mixture of probability distributions (commonly Gaussian).
3.	Optimization:
•	Maximizes the likelihood of the observed data under the given model.
4.	Iterative Nature:
•	Alternates between assigning probabilities (E-step) and refining the parameters (M-step).

Advantages of EM Clustering

1.	Soft Assignments:
•	Allows for overlapping clusters by assigning probabilities to data points, rather than forcing hard cluster boundaries.
2.	Flexible Models:
•	Works well with data that can be modeled by mixtures of distributions (e.g., Gaussian Mixture Models).
3.	Theoretical Foundation:
•	Based on the principle of maximum likelihood estimation, providing a rigorous statistical framework.
4.	Captures Cluster Shapes:
•	Handles clusters with different shapes, sizes, and orientations when using Gaussian distributions.
5.	Customizable:
•	Can work with non-Gaussian distributions, depending on the application.

Limitations of EM Clustering

1.	Initialization Sensitivity:
•	Like k-means, the EM algorithm is sensitive to the initial parameters and can converge to local optima.
2.	Assumption of Distribution:
•	Assumes the data comes from a specific distribution (e.g., Gaussian), which may not always hold in practice.
3.	Computational Complexity:
•	Computationally expensive for large datasets, especially with high-dimensional data.
4.	Overfitting:
•	Can overfit if the number of clusters or the model complexity is too high.
5.	Number of Clusters ( k ) Required:
•	Requires  k , the number of clusters, to be specified in advance.

Applications of EM Clustering

1.	Image Segmentation:
•	Clustering pixels based on color, intensity, or texture.
2.	Document Clustering:
•	Grouping documents based on probabilistic models of word frequencies.
3.	Anomaly Detection:
•	Identifying outliers in datasets by assigning low probabilities.
4.	Bioinformatics:
•	Identifying gene expression patterns or clustering protein sequences.
5.	Customer Segmentation:
•	Segmenting customers based on purchasing behavior and demographic data.

Comparison with k-Means

Aspect EM Clustering k-Means
Clustering Type Probabilistic (soft clustering). Hard clustering (each point belongs to one cluster).
Cluster Shape Can handle elliptical or non-spherical clusters. Assumes spherical clusters.
Model Based on statistical mixture models. Based on minimizing intra-cluster variance.
Output Probabilities for each cluster. Hard cluster assignments.
Flexibility Works with various probability distributions. Limited to distance-based clustering.

Example

Suppose you have data points distributed in an elliptical pattern, and you suspect they come from two Gaussian distributions. EM clustering will:
1. Initialize the means, covariances, and weights of the two Gaussians.
2. Iteratively adjust these parameters to maximize the likelihood of the observed data.
3. Output two Gaussian components and probabilities for each data point belonging to each component.

The resulting clusters will better capture the elliptical shape than k-means, which would likely split the clusters poorly.

Summary

EM clustering is a powerful algorithm for probabilistic clustering, particularly effective when data is generated by mixtures of distributions. Its ability to handle soft assignments and non-spherical clusters makes it a versatile alternative to distance-based methods like k-means, albeit with added computational complexity and sensitivity to initialization.

31
Q

What is the difference between EM and k-means?

A

The Expectation-Maximization (EM) algorithm and k-means are two clustering methods that differ fundamentally in their approach, assumptions, and outputs. Here’s a breakdown:
1. Clustering Type:
• EM performs soft clustering, meaning each data point is assigned a probability of belonging to each cluster.
• K-means uses hard clustering, where each point is assigned exclusively to one cluster.
2. Underlying Model:
• EM is a probabilistic method that models the data using statistical distributions, often assuming that the data is generated from a mixture of distributions (e.g., Gaussian Mixture Models).
• K-means is centroid-based, where clusters are represented by their centroids, and the algorithm minimizes the variance (distance) within clusters.
3. Cluster Shape:
• EM can handle clusters with complex shapes (e.g., elliptical or non-spherical) because it accounts for the covariance of the data.
• K-means assumes clusters are spherical and evenly distributed, which can limit its ability to handle irregular or overlapping clusters.
4. Assignment:
• EM calculates the probability (or responsibility) of each point belonging to each cluster. This allows for more flexibility in handling ambiguous points.
• K-means assigns each point to the nearest cluster based on a distance metric, typically Euclidean distance.
5. Output:
• EM produces probabilities for cluster membership, giving insight into the degree of association of points with multiple clusters.
• K-means outputs hard assignments, where each point belongs to only one cluster.
6. Flexibility:
• EM is more flexible since it can be applied to data that follows non-Gaussian distributions, depending on the choice of the underlying probability model.
• K-means is less flexible as it is tied to distance metrics and assumes clusters are compact and spherical.
7. Computational Complexity:
• EM is computationally more expensive due to the iterative estimation of probabilities and parameter updates.
• K-means is simpler and faster, making it more suitable for large datasets.
8. Sensitivity to Initialization:
• Both methods are sensitive to initialization, but EM mitigates this by iteratively refining probabilities based on a likelihood function, whereas k-means relies entirely on initial centroid placement.

In summary, EM is a probabilistic, flexible clustering method better suited for complex cluster shapes and overlapping data, while k-means is a simpler, faster approach effective for well-separated, spherical clusters.

32
Q

What is the difference between density-based clustering and representative-based clustering? Define it

A

Representative-based clustering methods are suitable for finding ellipsoid-shaped clusters (or convex clusters at best)
For non-convex clusters, these methods have trouble finding the true clusters
Density-based methods can mine such non-convex clusters

Definition:
• Clustering based on density (local cluster criterion), such as density-connected points
• Major features:
–Discover clusters of arbitrary shape
–Handle noise
–One scan
–Need density parameters as termination condition

33
Q

What does the concepts of directly density reachable, density reachable, density connected, and density-based cluster mean?

A

• Directly density reachable
–An object x is directly density-reachable from object y if x is within the ε-neighborhood of y and y is a core object

• Density Reachable
– An object x is density-reachable from object y if there is a chain of objects x1, …, xn where x1=x and xn=y such that xi+1 is directly density reachable from xi

• Density Connected
–An object p is density-connected to q with respect to ε and MinPts if there is an object o such that both p and q are density reachable from o

• Density-Based Cluster
–A density-based cluster is defined as a maximal set of density connected points.

34
Q

What is the complexity of DBScan?

A

• DBSCAN needs to compute the ε-neighborhood for each point
• If the dimensionality is not too high this can be done efficiently using a spatial index structure in
O(n log n)
• If the dimensionality is high, it takes O(n2) to compute the neighbourhood for each point.
• Once the ε-neighborhood has been computed the algorithm needs only a single pass over all the points to find the density connected clusters.
• The overall complexity of DBSCAN is between O(n log n) and O(n2) in the worst-case.

35
Q

When is DBScan not suitable?

A

DBScan fails when applied to data of varying density

36
Q

Define the HDBScan. What is the difference between DBScan and HDBScan?

A

HDBScan converts DBSCAN into a hierarchical clustering algorithm
Then applies a technique to extract a flat clustering based on the stability of clusters