Section 7 Clustering K-means Flashcards

Question

What's the downsides of the CH index

Answer 1

Still doesn’t work for K=1

Answer 2

Calinski-Harabasz index CHk=(Bk/(k-1))/(Wk/(N-K))

Answer 3

Run k-means for pre-specified range of values of K. Compute the CH index for each value of K. Plot the CH values versus K. The “appropriate” K corresponds to the largest value of CH.

Answer 4

The silhouette is a measure of how close each point in one cluster is to points in the neighboring clusters and si(K) takes values in the range [−1, 1].

Answer 5

Silhouette coefficients near 1 observation is far away from the neighbouring clusters. A value of 0 indicates that the observation is on or very close to the decision boundary of two neighbouring clusters Negative values means assigned to the wrong cluster.

Answer 6

Weighted average across all clusters by amount of observations in the cluster. Average of cluster*no of observations of clister summed up for all clusters / total number of observations

Answer 7

s(K) gives information about the cohesion of the overall clustering for the entire model and measures how appropriately the data has been clustered. Large value indicates that K could be appropriate

Answer 8

The tail of the left hand side of a section indicates that observations are in the wrong cluster

Answer 9

Largest average silhouette value. Not necessarily one right answer - might be happy with other depending on application. Drawback of the silhouette method is once again you cannot compare to K=1 disadvantage is no quantification of the uncertainty or variance at all.

Answer 10

We are comparing clustering in our data of interest to clustering in data which has 0 clusters. The gap statistic compares the observed within-cluster variation W(K) to W∗u(K), the within-cluster variation that we would have observed if we clustered points distributed uniformly over the data space

Answer 11

Always positive A significantly large gap indicates that the W(K) obtained by clustering the data points into K clusters is lower than the within cluster variation that we would have obtained by clustering uniformly This is indicative of evidence of a clustering partition into K clusters.

Answer 12

Generate B synthetic data sets values sampled uniformly in range of actual data For each dataset run K means with K clusters and comput log of within sum of squares Compute average of all these logs figures over all simulations to get expected value approx Compute gap statistic

Answer 13

We can generate synthetic dataset using the original space (Uniform data over the range of the observed data.) or PCA (Uniform data over a box aligned with the principal components of the data. PCA Method takes into account the shape of the data distribution Its better for variables which are highly correlated as by PCA you implicitly take out that correlation.

Answer 14

The optimal number of clusters K is the smallest K which is not smaller than the first local maximum gap its standard error:

Answer 15

First first local maximum gap statistic Compute differences Optimal number of clusters K is smallest k which is above local max gap- SE of that gap

Answer 16

Advantage of the gap statistic is you can compute for k=1 : so can see if there is evidence of clustering additionally we can get a better idea of the uncertainty of the estimate we make with standard errors etc Gap statistic is very much a statistical way to select the number of clusters

Answer 17

What the clusters we have found to be optimal mean? We can compare clustering partitions Let C denote the obtained clustering partition, while C∗ denote a reference classification/partition of the units. By comparing C to C∗, we use external information to measure the extent to which the obtained clustering partition agrees with supplied class labels.

Answer 18

Rand index or adjusted rand index

Answer 19

Labels of clusters are arbitrary accuracy is not appropriate

Answer 20

Measures agreement by looking at how many observations are placed in the same clusters and how many are placed in different clusters The Normalise by the number of possible allocations into the index

Answer 21

The Rand index tends to give quite large values - overestimates agreement. In fact, even if we randomly assign the units to the clusters, we get large Rand index values.

Answer 22

Adjusted Rand index adjusts for agreement due to chance. Index can be negative (but close to zero), indicating very low agreement. The maximum value is 1, indicating perfect agreement.

Answer 23

Relies on a lot of information. Always visualise your data first and then consider the actual application before using a method to find optimal number of customers

Answer 24

Log of the within-cluster sum of squares of k means algorithm performed on the original data

Answer 25

Estimate of the log of the within-cluster sum of squares of k means algorithm applied to the synthetic uniformly distributed data. Its an estimate from bootstrap of B replications

Answer 26

Standard error of E.log.W

Answer 27

NO Must select the optimal amount of clusters by internal validation metrics only without reference to any reference clusters. Unsupervised learning means cannot use a target variable to guide the learning process. Measures of external validation are used to assess the quality of a GIVEN clustering in comparison to a reference grouping of the observations.

Section 7 Clustering K-means Flashcards

(52 cards)