K-Means Clustering Flashcards
- What is unsupervised learning and how does it differ from supervised learning?
Unsupervised learning involves analyzing data (xn) without given labels (tn). Unlike supervised learning, which uses known labels to learn mappings, unsupervised learning focuses on discovering inherent structure in data, such as clustering customers based on their purchase histories.
- In the context of customer purchase data, how can clustering be used?
Each customer can be represented as a binary vector indicating which products they bought. Clustering groups customers who exhibit similar purchasing patterns, helping to identify customer segments or products often bought together.
- What is K-means clustering and what is its main objective?
K-means clustering is an unsupervised algorithm that partitions data into K clusters. Its main objective is to minimize the total within-cluster variance, or the sum of squared distances between each point and its assigned cluster centroid.
- How is each cluster defined in K-means clustering?
Each cluster is represented by a centroid µk (a point in the input space, e.g., µk = [µk1, µk2]^T). Data points are assigned to the cluster with the closest centroid, typically measured by Euclidean distance.
- Describe the iterative algorithm for K-means clustering in detail.
The K-means algorithm proceeds as follows: (1) Initialize K cluster centers (µ1, µ2, …, µK) randomly; (2) For each data point xn, assign it to the closest centroid by setting znk = 1 if xn is in cluster k (and 0 otherwise); (3) Update each centroid µk to be the average of all points assigned to that cluster, i.e., µk = (Σn znk xn)/(Σn znk); (4) Repeat steps 2 and 3 until the assignments do not change.
- What is the cost function that K-means minimizes and what are the constraints on the assignment variables?
K-means minimizes the cost function J = Σₙ Σₖ znk ||xn - µk||² subject to znk ∈ {0,1} and Σₖ znk = 1 for each n, meaning each data point is assigned to exactly one cluster.
- Why is there no analytical solution for the centroids µk in K-means?
Because the assignment of data points to clusters is discrete and depends on the centroids themselves, the problem becomes non-convex. Therefore, an iterative algorithm is used to approximate a local minimum.
- What are two major issues that affect the performance of K-means clustering?
The two main issues are: (1) Choosing the number of clusters K; (2) Initializing the cluster centers, as poor initialization can lead to suboptimal solutions or convergence to a bad local minimum.
- How does the K-means++ algorithm improve cluster initialization?
K-means++ improves initialization by selecting the first center randomly, and then choosing subsequent centers with a probability proportional to the squared distance from the nearest already chosen center. This method spreads out the initial centers and leads to a provably good O(log K) approximation of the optimal clustering.
- What is the Elbow heuristic and how is it used to choose the number of clusters K?
The Elbow heuristic involves plotting the total within-cluster sum of squares as a function of K and looking for a point (the ‘elbow’) where the rate of decrease sharply changes. This point indicates that adding more clusters beyond it yields diminishing returns.
- What is the Sum of Norms (SON) formulation in clustering and what extremes does it represent?
The SON formulation minimizes an objective of the form Σₙ ||xn - µ(xn)||² + λ Σₚ<q ||µp - µq||², where µ(xn) is the centroid of the cluster to which xn is assigned. If only the first term is used (λ = 0), every point becomes its own cluster (K = N); if only the second term is used (λ very large), all centroids collapse to one point (K = 1). Varying λ steers the solution between these extremes.
- When might K-means fail even if the data have clear cluster structure?
K-means may fail when clusters are non-spherical, elongated, or when one cluster cannot be adequately represented by a single point. In such cases, the assumption of a single centroid per cluster does not capture the true structure of the data.
- What is kernel K-means and why might one use it?
Kernel K-means is an extension of the standard K-means algorithm that uses kernel functions to implicitly map data into a high-dimensional feature space. This allows it to capture non-linear cluster structures by replacing the Euclidean distance with a kernel-induced distance.
- How is the squared distance between a point and a cluster center expressed in kernel K-means?
The squared distance is written as k(xn, xn) - (2/Nk) Σₘ zmk k(xn, xm) + (1/Nk²) Σₘₗ zmk zlk k(xm, xl), where k(x, x’) is the kernel function and Nk is the number of points in cluster k.
- Why is it beneficial to use the kernel trick in clustering?
The kernel trick allows the algorithm to compute distances in an implicit high-dimensional space without explicitly mapping the data. This enables the capture of complex, non-linear structures in the data while keeping computations efficient.
- How do mixture models differ from K-means clustering in terms of their approach to modeling data?
Mixture models assume that data points are generated by a mixture of underlying probability distributions (e.g., Gaussian mixtures) and assign soft, probabilistic cluster memberships. In contrast, K-means makes hard assignments based solely on minimizing Euclidean distances.
- What is one advantage of mixture models over K-means clustering?
Mixture models provide soft assignments, meaning they estimate the probability that each data point belongs to each cluster. This probabilistic framework can capture uncertainty and overlapping clusters better than the hard assignments of K-means.
- In unsupervised learning, why is clustering considered a fundamental task?
Clustering is fundamental because it helps reveal the underlying structure and natural groupings within data without the need for labels. This insight can be used for data exploration, segmentation, and as a preprocessing step for other machine learning tasks.
- How can K-means clustering be applied to group products that are frequently bought together?
Each product can be represented by features such as purchase frequency or co-occurrence with other products. K-means can then cluster products based on these features, grouping those that are commonly purchased together.
- Summarize the key steps of the K-means clustering algorithm.
The algorithm involves: 1) Randomly initializing K centroids; 2) Assigning each data point to the nearest centroid; 3) Updating each centroid as the mean of the points assigned to it; 4) Repeating steps 2 and 3 until the cluster assignments no longer change.
- What is the objective function of K-means clustering and why is it minimized?
The objective function is J = Σₙ Σₖ znk ||xn - µk||², which represents the total within-cluster variance. Minimizing this function leads to compact, well-separated clusters.
- How does K-means ensure that each data point is assigned to exactly one cluster?
The assignment variable znk is defined such that znk ∈ {0, 1} and for each data point n, Σₖ znk = 1. This constraint ensures that every point is assigned to one and only one cluster.
- What challenges might arise when choosing the initial cluster centers in K-means, and how does this affect the algorithm?
Poor initialization can cause K-means to converge to a suboptimal local minimum, leading to ineffective clustering. The outcome can be highly sensitive to the initial centers, making proper initialization (e.g., via K-means++) critical for good performance.
- Explain the concept of parameter sharing in the context of clustering algorithms. Is this concept directly applicable to K-means?
In neural networks, parameter sharing means using the same parameters across different parts of the input. While K-means does not share parameters in the same sense, it uses a consistent distance metric and update rule for all clusters, which standardizes the clustering process across the dataset.
- How can cross-validation be used in the context of K-means clustering, despite it being unsupervised?
Although K-means is unsupervised, one can use techniques like the Gap Statistic or evaluate clustering quality (e.g., silhouette score) on hold-out data to assess the stability and optimal number of clusters. These methods help in selecting the best value of K.
- Describe the kernel K-means algorithm in brief steps.
Kernel K-means involves: 1) Selecting a kernel function k(x,x’) and its parameters; 2) Initializing random cluster assignments znk; 3) For each data point, computing the kernel-induced distance to each cluster using the formula k(xn,xn) - (2/Nk) Σₘ zmk k(xn, xm) + (1/Nk²) Σₘₗ zmk zlk k(xm, xl); 4) Reassigning points to the cluster with the smallest distance; 5) Repeating until assignments converge.
- What additional parameters must be chosen when using kernel K-means compared to standard K-means?
When using kernel K-means, one must choose the kernel function (e.g., Gaussian, polynomial) and set its associated parameters (such as the bandwidth parameter β in a Gaussian kernel).
- What is the key benefit of kernel K-means over standard K-means?
Kernel K-means can capture non-linear cluster structures by implicitly mapping the data into a higher-dimensional space using the kernel trick, allowing it to form clusters that are not linearly separable in the original input space.
- What is a potential drawback of K-means clustering regarding its sensitivity to the choice of K and initialization?
K-means is sensitive to the number of clusters K chosen and the initial placement of centroids. A poor choice can lead to suboptimal clusters, and the algorithm may converge to a local minimum rather than the global optimum.
- How does the SON (Sum of Norms) formulation help in clustering without having to pre-specify K?
The SON formulation minimizes an objective that includes a term for the distance of points from their assigned centers and a penalty term for differences between centers. By varying the regularization parameter λ, one can steer the solution between K = N (each point is its own cluster) and K = 1 (all points in one cluster), effectively allowing the model to decide the number of clusters automatically.
- What happens in the SON formulation when only the first term is considered?
If only the first term is considered (i.e., λ = 0), then each data point is perfectly represented by its own center, resulting in K = N clusters.
- Conversely, what is the outcome in the SON formulation if only the second term is considered?
If only the second term is considered (i.e., λ is extremely large), all cluster centers collapse to a single point, resulting in K = 1 cluster.
- How can varying the parameter λ in the SON formulation affect the clustering result?
By adjusting λ, one can control the trade-off between fitting each point perfectly (many clusters) and forcing centers to be similar (fewer clusters). This allows a smooth transition between having many clusters (K = N) and a single cluster (K = 1), thereby indirectly choosing an appropriate number of clusters.
- Summarize the main ideas behind K-means clustering.
K-means is an unsupervised algorithm that partitions data into K clusters by iteratively assigning data points to the nearest centroid and updating centroids as the mean of assigned points, all while minimizing the sum of squared distances within clusters.
- Summarize the advantages and challenges of K-means clustering.
Advantages include simplicity, efficiency, and ease of implementation. Challenges involve choosing the appropriate number of clusters K, sensitivity to initialization, and limitations in capturing non-spherical cluster structures.
- How do mixture models differ from K-means in clustering, and what is their main modeling approach?
Mixture models assume that data is generated from a mixture of underlying probability distributions (e.g., Gaussian mixtures), leading to soft (probabilistic) cluster assignments rather than hard assignments as in K-means. They provide a generative probabilistic model of the data.
- What are the key takeaways regarding kernel K-means and its use in clustering non-linearly separable data?
Kernel K-means uses a kernel function to compute distances in an implicit feature space, allowing the algorithm to capture non-linear relationships. It extends K-means by enabling clustering of data that is not well-separated by a linear boundary in the original space.
- In what scenario might K-means fail to accurately represent clusters even when the data has a clear structure?
K-means may fail when clusters have irregular shapes or when a cluster’s structure cannot be captured by a single centroid (e.g., a ring-shaped or elongated cluster), leading to poor clustering performance.
- What are some common methods for evaluating the quality of clustering produced by K-means?
Common evaluation methods include calculating the within-cluster sum of squares, the silhouette score, and using the Elbow method or Gap Statistic to assess the compactness and separation of clusters.
- Provide a concise summary of K-means clustering and its extensions covered in the lecture.
K-means clustering partitions data into K groups by iteratively updating cluster assignments and centroids to minimize within-cluster variance. Key challenges include choosing K and initialization. Extensions like K-means++ improve initialization, SON formulations allow automatic determination of K, and kernel K-means enables clustering in non-linear feature spaces by employing the kernel trick.