Section 7 Clustering K-means Flashcards
Define clustering
refers to a broad collection of unsupervised learning methods for finding groups of units in the data
Units belonging to the same cluster are more similar to each other than units belonging to different clusters.
Lots of approaches to clustering - we will focus on K-means
What are the parameters of k-means
The k-means clustering algorithm depends on two parameters:
K – The number of clusters.
μk – Cluster centroids
What could a cluster represent?
Cluster could be as a result of anything, a colour, a species separation, a behavioural separation, but with unsupervised learning we don’t know what this cause is.
In our data we will be using we will often have the output
In reality you don’t know this.
Explain how k means works in practice
Number of clusters K is an input
K means algorithm has two steps allocation and update.
Algorithm is initialized with randomly chosen cluster centroids.
Allocation: assign each observation to closest cluster centroid
Update: update centroids computing the mean of the points assigned to each cluster.
Converges until output is a vector of cluster allocations
What does convergence of k means algorithm mean
This repeats until the algorithm converges and reaches a stable configuration.
Basically, process is repeated until points are no longer moved between groups
Explain Euclidean distance and its relevance to k means
The Euclidean distance between a pair of points (xi, xh) is the straight line distance between the points
K-means involves thinking about the dissimilarity between points in terms of distance
Why do we used the squared Euclidean distance
Were examining the relative distance between points.
The squared Euclidean distance is often used in optimization problems as square roots are hard to deal with.
What is Z
Zi is the vector of length k where in the position of the cluster the observation is allocated to there is a one, everywhere else there is a zero.
What is the objective function fo the k means algorithm
The objective function uses the squared Euclidean distance, and it corresponds to the total within-cluster sum of squares, W .
The objective is to find the clustering that minimizes this sum of squares ie the objective function.
How is k means optimisation similar to ANOVA - whats the difference
This is similar to ANOVA concept except here we don’t know the grouping in advance, with ANOVA you do.
What is the allocation step of the iterative procedure of k means
Find the allocation values z(t)ik that minimise the objective function, holding the centroid values fixed, μk(t−1)
The value of k that gives the smallest value of the square function dictates which Z value is equal to 1.
What is the updating step of the iterative procedure of k means
Find the centroid values μk(t) that minimise the objective function, holding the allocation values fixed, zik(t)
What is the centroid
Centroid is the mean of observations in a given cluster.
What is one assumption heavily relied upon for k means and how is this rectified
Algorithm is started from random values for the centroids.
Different starting points could lead to different sub-optimal solutions.
Better run the algorithm several times with different starting points as results heavily rely upon this assumption.
Keep the solution with the minimum total within the sum of squares, as the best overall.
Explain internal and external validation
Internal validation – Measures of internal cluster consistency: can be useful to select the number of clusters
External validation – Comparing clustering to external reference clustering: useful to assess the quality of a given clustering
What two points should k value in k means algorithm highlight
Data points belonging to the same cluster should be similar to each other and close to the centroid, points belonging to different clusters should be dissimilar.
Name 4 methods of internal validation to quantify the characteristics and suitability of K
Elbow method - not very accurate
Calinski-Harabasz index.
Silhouette.
Gap statistic.
What is Wk
The W(K) or total within sum of squares, measures the variability “within”, that is the variability between the data points xi assigned to cluster k and the corresponding centroid μk.
What is Bk
The B(K), or the between sum of squares, measures the variability “between”, that is the variability between the clusters accounting for the centre of the data.
What is total sum of squares
Sum over all clusters of Wk+Bk
What is the elbow method and how to it identify optimal K
The elbow method identifies the optimal number of clusters K by plotting W(K) as a function of a range of values of the number of clusters.
Wherever the elbow appears
What are the issues with elbow method - NOT ONE TO USE
Does not work for K=1.
Not easy to identify the elbow at times - could make arguments for multiple answers.
No sense of uncertainty or of multiple plausible appropriate values of K.
How does k means algorithm try to achieve Wk and Bk
Algorithm returns the clustering minimising W(K), maximising B(K).
Wants a balance
Explain the CH index
A large value of this index for a given value of K is indication of a clustering solution with low within variability and large between cluster variability. Want a balance between within and between variation
CH is based on the distance between points in a cluster and their centroid, but not on distance between the data points.
What’s the downsides of the CH index
Still doesn’t work for K=1
Define CH index what letters stand for and the formula
Calinski-Harabasz index
CHk=(Bk/(k-1))/(Wk/(N-K))
How to choose optimal k with CH index method
Run k-means for pre-specified range of values of K.
Compute the CH index for each value of K.
Plot the CH values versus K.
The “appropriate” K corresponds to the largest value of CH.
What is a silhouette
The silhouette is a measure of how close each point in one cluster is to points in the neighboring clusters and si(K) takes values in the range [−1, 1].
What do the different values of a silhouette mean
Silhouette coefficients near 1 observation is far away from the neighbouring clusters.
A value of 0 indicates that the observation is on or very close to the decision boundary of two neighbouring clusters
Negative values means assigned to the wrong cluster.
How is the average silhouette calculated
Weighted average across all clusters by amount of observations in the cluster. Average of cluster*no of observations of clister summed up for all clusters / total number of observations
What does average silhouette value give insight into for whole data set
s(K) gives information about the cohesion of the overall clustering for the entire model and measures how appropriately the data has been clustered.
Large value indicates that K could be appropriate
What does tails on silhouette plots mean
The tail of the left hand side of a section indicates that observations are in the wrong cluster
How is optimal K selected from silhouettes, what are the method drawbacks
Largest average silhouette value. Not necessarily one right answer - might be happy with other depending on application.
Drawback of the silhouette method is once again you cannot compare to K=1
disadvantage is no quantification of the uncertainty or variance at all.
Explain the premise of the gap statistic
We are comparing clustering in our data of interest to clustering in data which has 0 clusters.
The gap statistic compares the observed within-cluster variation W(K) to W∗u(K), the within-cluster variation that we would have observed if we clustered points distributed uniformly over the data space
What traits does the gap statistic have, particular for a good K
Always positive
A significantly large gap indicates that the W(K) obtained by clustering the data points into K clusters is lower than the within cluster variation that we would have obtained by clustering uniformly
This is indicative of evidence of a clustering partition into K clusters.
How is gap statistic computed
Generate B synthetic data sets values sampled uniformly in range of actual data
For each dataset run K means with K clusters and comput log of within sum of squares
Compute average of all these logs figures over all simulations to get expected value approx
Compute gap statistic
What function in R carries out gap statistic calculation
ClusGap
How can the syntenic dataset be generated to calculate a gap statistic
We can generate synthetic dataset using the original space (Uniform data over the range of the observed data.)
or PCA (Uniform data over a box aligned with the principal components of the data.
PCA Method takes into account the shape of the data distribution
Its better for variables which are highly correlated as by PCA you implicitly take out that correlation.
According to the gap statistic what is the optimal number of clusters K
The optimal number of clusters K is the smallest K which is not smaller than the first local maximum gap its standard error:
In practice steps to find optimal K by gap statistic are?
First first local maximum gap statistic
Compute differences
Optimal number of clusters K is smallest k which is above local max gap- SE of that gap
What are the advantages of the gap statistic method
Advantage of the gap statistic is you can compute for k=1 : so can see if there is evidence of clustering
additionally we can get a better idea of the uncertainty of the estimate we make with standard errors etc
Gap statistic is very much a statistical way to select the number of clusters
Explain the concept of external validation
What the clusters we have found to be optimal mean?
We can compare clustering partitions
Let C denote the obtained clustering partition, while C∗ denote a reference
classification/partition of the units.
By comparing C to C∗, we use external information to measure the extent to which the obtained clustering partition agrees with supplied class labels.
What metric can we use for external validation
Rand index or adjusted rand index
Why do we never use accuracy for external validation?
Labels of clusters are arbitrary accuracy is not appropriate
What is the rand index
Measures agreement by looking at how many observations are placed in the same clusters and how many are placed in different clusters
The Normalise by the number of possible allocations into the index
What flaw is in the rand index
The Rand index tends to give quite large values - overestimates agreement. In fact, even if we randomly assign the units to the clusters, we get large Rand index values.
What is the adjusted rand index
Adjusted Rand index adjusts for agreement due to chance.
Index can be negative (but close to zero), indicating very low agreement.
The maximum value is 1, indicating perfect agreement.
Why is selection of number of clusters not easy
Relies on a lot of information. Always visualise your data first and then consider the actual application before using a method to find optimal number of customers
Whats log W in gap statistic output
Log of the within-cluster sum of squares of k means algorithm performed on the original data
Whats E.log W in gap statistic output
Estimate of the log of the within-cluster sum of squares of k means algorithm applied to the synthetic uniformly distributed data. Its an estimate from bootstrap of B replications
Whats SE.sim in the gap statistic output
Standard error of E.log.W
Can i select optimal clusters using rand indexes
NO
Must select the optimal amount of clusters by internal validation metrics only without reference to any reference clusters. Unsupervised learning means cannot use a target variable to guide the learning process. Measures of external validation are used to assess the quality of a GIVEN clustering in comparison to a reference grouping of the observations.