5.2 Clustering Flashcards
Explain the process of the k-means clustering algorithm. What is its main objective?
The k-means clustering algorithm follows these steps:
- Choose k, the number of clusters.
- Randomly assign each observation to one of the clusters.
- Calculate each cluster’s centroid.
- Reassign each observation to the closest centroid.
- Repeat steps 3 and 4 until the cluster assignments stop changing.
The main objective is to minimize the total within-cluster variation (sum of squares).
What is the purpose of the elbow method in k-means clustering, and how is it implemented?
The elbow method helps determine the optimal number of clusters (k) in k-means clustering. It examines the proportion of variance explained as each new cluster is added. It is implemented by:
- Performing k-means for a range of k values.
- Calculating the proportion of variance explained for each k. The proportion of variance explained is the ratio of the between-cluster sum of squares to the total sum of squares, where the total sum of squares is the total within-cluster sum of squares when k=1 and the between-cluster sum of squares is the difference between the total sum of squares and the total within-cluster sum of squares. As k increases, the total within-cluster sum of squares decreases while the between-cluster sum of squares increases.
- Plotting proportion of variance explained vs k.
- Choosing k at the “elbow” where the rate of increase in explained variance levels off.
Describe the agglomerative hierarchical clustering algorithm. What is its main objective?
The agglomerative hierarchical clustering algorithm follows these steps:
- Starts with each observation as its own cluster.
- Calculates the pairwise inter-cluster dissimilarity between all clusters.
- Fuses the two clusters with lowest dissimilarity.
- Repeats steps 2 and 3 until all observations are in one cluster.
The main objective is to create a hierarchical representation of clusters called a dendrogram.
Compare and contrast the four common linkages used in hierarchical clustering. Which ones are generally preferred and why?
The linkage simply defines how we measure the dissimilarity between clusters.
Four common linkages:
- Complete: Largest pairwise dissimilarity
- Single: Smallest pairwise dissimilarity
- Average: Average pairwise dissimilarity
- Centroid: Dissimilarity between centroids
Complete and average linkages are often preferred as they yield more balanced dendrograms. Single linkage tends to produce skewed dendrograms, and centroid linkage can lead to inversions.
What is a dendrogram, and how is it interpreted in the context of hierarchical clustering?
A dendrogram is a tree-like representation of hierarchical clustering results. It shows observations as leaves at the bottom of the diagram. The height of the joins represents the inter-cluster dissimilarity. Clusters are formed at different levels of dissimilarity, based on where the joins occur.
How does the choice of dissimilarity measure (Euclidean distance vs. correlation-based distance) affect clustering results?
Euclidean distance focuses on numerical closeness of values, while correlation-based distance looks at patterns across variables. This can lead to different clustering results, especially when observations have similar patterns but different magnitudes.
Explain the concept of inversions in hierarchical clustering. Which linkage is prone to this issue?
Inversions occur when clusters join at a height lower than either individual cluster, complicating dendrogram interpretation. Centroid linkage is prone to this issue.
Explain how you would determine the final cluster assignments in hierarchical clustering after creating the dendrogram.
To determine final cluster assignments in hierarchical clustering, choose a height at which to cut the dendrogram. This will assign each observation to a cluster based on the cut.
How does k-means clustering differ from hierarchical clustering in terms of specifying the number of clusters?
k-means requires specifying the number of clusters before running the algorithm. Hierarchical clustering does not require this; the number of clusters is determined after creating the dendrogram by choosing where to cut it.
Why is standardization of variables important in both k-means and hierarchical clustering?
Standardization is important because clustering algorithms rely heavily on distance measurements. Without standardization, variables with larger scales can disproportionately influence clustering outcomes. Standardization ensures all variables contribute equally to distance calculations.
How does the curse of dimensionality affect clustering algorithms?
The curse of dimensionality affects clustering in several ways. It makes visualization difficult beyond three dimensions. It reduces the ability to discriminate distances between close and far observations in high-dimensional spaces. Additionally, it requires more observations to maintain the same level of information as the number of dimensions increases.
Compare k-means and hierarchical clustering in terms of their sensitivity to initial conditions and reproducibility of results.
K-means is sensitive to initialization because of the random initial assignments, which can lead to different results across runs. In contrast, hierarchical clustering is deterministic and produces the same result each time for a given dataset.
How can the results of k-means clustering be visualized? What about hierarchical clustering?
Both k-means clustering and hierarchical clustering results can be visualized using scatterplots of the clustered features, with data points colored by cluster assignment. Additionally, hierarchical clustering results are typically visualized using a dendrogram.