Clustering Flashcards
DS
Briefly explain how the K-Means Clustering algorithm works.
K-means is an unsupervised clustering algorithm that partitions observations into k clusters.
The cluster means are usually randomized at the start, often by choosing random observations from the data, and then shifted/updated as more records are observed.
At each iteration, a new observation is assigned to a cluster based on which cluster mean it is nearest and then the means are recalculated/updated, with the new observation information included.
What is one common use case for K-Means clustering
Customer segmentation is probably the most common use case for k-means clustering (although it has many uses in various industries).
Often, unsupervised clustering is used to identify groups of similar customers (or data points) and then another predictive model is trained on each cluster. Then, new customers are first assigned a cluster tand then scored using the appropriate model.
Why is it difficult to identify the “ideal” number of clusters in the dataset using K-Means Clustering?
There is no “ideal” number of clusters since increasing the number of clusters always captures more information about the features (the limiting case is k=number of observations, .i.e. each observation is a cluster).
Having said that, there are numerous heuristics that attempt to identify the “optimal” number of clusters by recognizing when increasing the number of clusters only marginally increases information captured.
The true answer is usually driven by the application, though. If a business has the ability to create 4 different offers, then they may want to create 4 customer clusters, regardless of the data.
What is one heuristic to select “k” for K-Means Clustering?
One such method is the elbow method. In short, it attempts to identify the point at which adding additional clusters only marginally increases the variance explained by the clusters.
The elbow is the point at which we begin to see diminishing returns in explained variance when increasing k.
Does K-Means Clustering always converge to the same set of clusters? How does this affect the use of K-Means Clustering in production models?
No, there is no guarantee that k-means clustering converges to the same set of clusters, even given the same samples from the same population.
The clusters that are produced may be radically different based on the initial cluster mean selected.
For this reason, it is important that the cluster definitions remain static when using k-means clustering in production to ensure that different clusters aren’t created each time during training.
Explain how to evaluate the performance of a clustering model
The short answer is you probably can’t, at least not in the way you want, as in supervised learning.
While we cannot evaluate predictions vs. true values, as we don’t have a target vector, we can evaluate the nature of the clusters themselves. Intuitively, we can imagine “good” as having small distances between observations in the same cluster (i.e. dense clusters) and large distances between the different clusters (i.e. well separated clusters).
Silhouette coefs provide a single value measuring both traits. Formally, the ith observation’s silhouette coef is:
si = (bi - ai) /max(ai, bi)
where si is the silhouette coef for obs i, ai is the mean distance between i and all obs of the same class, and bi is the mean distance between i and all obs from the closest cluster of a different class. A silhouette score in sklearn is the mean silhouette coef for all obs i, with values ranging between [-1, 1], with 1 indicating dense, well separated clusters.
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
# cluster by k-means model = KMeans(n_cluster=2, random_state=1).fit(X)
# get predicted classes pred = model.labels_
# evaluate model sil_score = silhouette_score(X, pred) # 0.89162
Provide the steps of the k-means clustering algo
say we have data:
x x x x o o o o z z z z
DO until clusters DO NOT CHANGE:
- Select the number of clusters you want to identify in the data, this is the “k”. Say we select k=3.
- RANDOMLY select 3 DISTINCT data points as the INITIAL CLUSTERS.
- For each data point, measure the euclidean distance sqrt(x^2+y^2) between the point and the EACH INITIAL cluster
- Assign the each point to the NEAREST CLUSTER
- Calculate the MEAN of EACH CLUSTER
Evaluation metric is to COMPARE VARIATION within each cluster:
By plotting k=1,2,… with x-axis number of k clusters and REDUCTION in cluster VARIATION on y-axis, we can viz when there is the LARGEST REDUCTION in CLUSTER VARIATION at a given k.
The above is called an ELBOW PLOT, and shows where the LARGEST REDUCTION in CLUSTER VARIATION exists for k, but after that, larger k values do NOT REDUCE VARIATION.