Section 7 Clustering K-means Flashcards

1
Q

Define clustering

A

refers to a broad collection of unsupervised learning methods for finding groups of units in the data
Units belonging to the same cluster are more similar to each other than units belonging to different clusters.
Lots of approaches to clustering - we will focus on K-means

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the parameters of k-means

A

The k-means clustering algorithm depends on two parameters:
K – The number of clusters.
μk – Cluster centroids

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What could a cluster represent?

A

Cluster could be as a result of anything, a colour, a species separation, a behavioural separation, but with unsupervised learning we don’t know what this cause is.
In our data we will be using we will often have the output
In reality you don’t know this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain how k means works in practice

A

Number of clusters K is an input
K means algorithm has two steps allocation and update.
Algorithm is initialized with randomly chosen cluster centroids.
Allocation: assign each observation to closest cluster centroid
Update: update centroids computing the mean of the points assigned to each cluster.
Converges until output is a vector of cluster allocations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does convergence of k means algorithm mean

A

This repeats until the algorithm converges and reaches a stable configuration.
Basically, process is repeated until points are no longer moved between groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain Euclidean distance and its relevance to k means

A

The Euclidean distance between a pair of points (xi, xh) is the straight line distance between the points
K-means involves thinking about the dissimilarity between points in terms of distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why do we used the squared Euclidean distance

A

Were examining the relative distance between points.
The squared Euclidean distance is often used in optimization problems as square roots are hard to deal with.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Z

A

Zi is the vector of length k where in the position of the cluster the observation is allocated to there is a one, everywhere else there is a zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the objective function fo the k means algorithm

A

The objective function uses the squared Euclidean distance, and it corresponds to the total within-cluster sum of squares, W .
The objective is to find the clustering that minimizes this sum of squares ie the objective function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is k means optimisation similar to ANOVA - whats the difference

A

This is similar to ANOVA concept except here we don’t know the grouping in advance, with ANOVA you do.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the allocation step of the iterative procedure of k means

A

Find the allocation values z(t)ik that minimise the objective function, holding the centroid values fixed, μk(t−1)

The value of k that gives the smallest value of the square function dictates which Z value is equal to 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the updating step of the iterative procedure of k means

A

Find the centroid values μk(t) that minimise the objective function, holding the allocation values fixed, zik(t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the centroid

A

Centroid is the mean of observations in a given cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is one assumption heavily relied upon for k means and how is this rectified

A

Algorithm is started from random values for the centroids.
Different starting points could lead to different sub-optimal solutions.
Better run the algorithm several times with different starting points as results heavily rely upon this assumption.
Keep the solution with the minimum total within the sum of squares, as the best overall.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain internal and external validation

A

Internal validation – Measures of internal cluster consistency: can be useful to select the number of clusters
External validation – Comparing clustering to external reference clustering: useful to assess the quality of a given clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What two points should k value in k means algorithm highlight

A

Data points belonging to the same cluster should be similar to each other and close to the centroid, points belonging to different clusters should be dissimilar.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Name 4 methods of internal validation to quantify the characteristics and suitability of K

A

Elbow method - not very accurate
Calinski-Harabasz index.
Silhouette.
Gap statistic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is Wk

A

The W(K) or total within sum of squares, measures the variability “within”, that is the variability between the data points xi assigned to cluster k and the corresponding centroid μk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is Bk

A

The B(K), or the between sum of squares, measures the variability “between”, that is the variability between the clusters accounting for the centre of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is total sum of squares

A

Sum over all clusters of Wk+Bk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the elbow method and how to it identify optimal K

A

The elbow method identifies the optimal number of clusters K by plotting W(K) as a function of a range of values of the number of clusters.
Wherever the elbow appears

22
Q

What are the issues with elbow method - NOT ONE TO USE

A

Does not work for K=1.
Not easy to identify the elbow at times - could make arguments for multiple answers.
No sense of uncertainty or of multiple plausible appropriate values of K.

23
Q

How does k means algorithm try to achieve Wk and Bk

A

Algorithm returns the clustering minimising W(K), maximising B(K).
Wants a balance

24
Q

Explain the CH index

A

A large value of this index for a given value of K is indication of a clustering solution with low within variability and large between cluster variability. Want a balance between within and between variation
CH is based on the distance between points in a cluster and their centroid, but not on distance between the data points.

25
Q

What’s the downsides of the CH index

A

Still doesn’t work for K=1

26
Q

Define CH index what letters stand for and the formula

A

Calinski-Harabasz index

CHk=(Bk/(k-1))/(Wk/(N-K))

27
Q

How to choose optimal k with CH index method

A

Run k-means for pre-specified range of values of K.
Compute the CH index for each value of K.
Plot the CH values versus K.
The “appropriate” K corresponds to the largest value of CH.

28
Q

What is a silhouette

A

The silhouette is a measure of how close each point in one cluster is to points in the neighboring clusters and si(K) takes values in the range [−1, 1].

29
Q

What do the different values of a silhouette mean

A

Silhouette coefficients near 1 observation is far away from the neighbouring clusters.
A value of 0 indicates that the observation is on or very close to the decision boundary of two neighbouring clusters
Negative values means assigned to the wrong cluster.

30
Q

How is the average silhouette calculated

A

Weighted average across all clusters by amount of observations in the cluster. Average of cluster*no of observations of clister summed up for all clusters / total number of observations

31
Q

What does average silhouette value give insight into for whole data set

A

s(K) gives information about the cohesion of the overall clustering for the entire model and measures how appropriately the data has been clustered.
Large value indicates that K could be appropriate

32
Q

What does tails on silhouette plots mean

A

The tail of the left hand side of a section indicates that observations are in the wrong cluster

33
Q

How is optimal K selected from silhouettes, what are the method drawbacks

A

Largest average silhouette value. Not necessarily one right answer - might be happy with other depending on application.

Drawback of the silhouette method is once again you cannot compare to K=1
disadvantage is no quantification of the uncertainty or variance at all.

34
Q

Explain the premise of the gap statistic

A

We are comparing clustering in our data of interest to clustering in data which has 0 clusters.
The gap statistic compares the observed within-cluster variation W(K) to W∗u(K), the within-cluster variation that we would have observed if we clustered points distributed uniformly over the data space

35
Q

What traits does the gap statistic have, particular for a good K

A

Always positive
A significantly large gap indicates that the W(K) obtained by clustering the data points into K clusters is lower than the within cluster variation that we would have obtained by clustering uniformly
This is indicative of evidence of a clustering partition into K clusters.

36
Q

How is gap statistic computed

A

Generate B synthetic data sets values sampled uniformly in range of actual data
For each dataset run K means with K clusters and comput log of within sum of squares
Compute average of all these logs figures over all simulations to get expected value approx
Compute gap statistic

37
Q

What function in R carries out gap statistic calculation

A

ClusGap

38
Q

How can the syntenic dataset be generated to calculate a gap statistic

A

We can generate synthetic dataset using the original space (Uniform data over the range of the observed data.)
or PCA (Uniform data over a box aligned with the principal components of the data.
PCA Method takes into account the shape of the data distribution
Its better for variables which are highly correlated as by PCA you implicitly take out that correlation.

39
Q

According to the gap statistic what is the optimal number of clusters K

A

The optimal number of clusters K is the smallest K which is not smaller than the first local maximum gap its standard error:

40
Q

In practice steps to find optimal K by gap statistic are?

A

First first local maximum gap statistic
Compute differences
Optimal number of clusters K is smallest k which is above local max gap- SE of that gap

41
Q

What are the advantages of the gap statistic method

A

Advantage of the gap statistic is you can compute for k=1 : so can see if there is evidence of clustering
additionally we can get a better idea of the uncertainty of the estimate we make with standard errors etc
Gap statistic is very much a statistical way to select the number of clusters

42
Q

Explain the concept of external validation

A

What the clusters we have found to be optimal mean?
We can compare clustering partitions
Let C denote the obtained clustering partition, while C∗ denote a reference
classification/partition of the units.
By comparing C to C∗, we use external information to measure the extent to which the obtained clustering partition agrees with supplied class labels.

43
Q

What metric can we use for external validation

A

Rand index or adjusted rand index

44
Q

Why do we never use accuracy for external validation?

A

Labels of clusters are arbitrary accuracy is not appropriate

45
Q

What is the rand index

A

Measures agreement by looking at how many observations are placed in the same clusters and how many are placed in different clusters
The Normalise by the number of possible allocations into the index

46
Q

What flaw is in the rand index

A

The Rand index tends to give quite large values - overestimates agreement. In fact, even if we randomly assign the units to the clusters, we get large Rand index values.

47
Q

What is the adjusted rand index

A

Adjusted Rand index adjusts for agreement due to chance.
Index can be negative (but close to zero), indicating very low agreement.
The maximum value is 1, indicating perfect agreement.

48
Q

Why is selection of number of clusters not easy

A

Relies on a lot of information. Always visualise your data first and then consider the actual application before using a method to find optimal number of customers

49
Q

Whats log W in gap statistic output

A

Log of the within-cluster sum of squares of k means algorithm performed on the original data

50
Q

Whats E.log W in gap statistic output

A

Estimate of the log of the within-cluster sum of squares of k means algorithm applied to the synthetic uniformly distributed data. Its an estimate from bootstrap of B replications

51
Q

Whats SE.sim in the gap statistic output

A

Standard error of E.log.W

52
Q

Can i select optimal clusters using rand indexes

A

NO
Must select the optimal amount of clusters by internal validation metrics only without reference to any reference clusters. Unsupervised learning means cannot use a target variable to guide the learning process. Measures of external validation are used to assess the quality of a GIVEN clustering in comparison to a reference grouping of the observations.