Clustering Flashcards

Question 1

Q

Explain the steps taken in the K-means algorithm

Answer

A

Randomly generate k random centroids.
Check the distance from each sample to all centroids.
Assign the closest centroid to each sample.
Compute new centroids.
If no convergence, go back to 2.

Question 2

Q

What are the pros and cons of k-means clustering?

Answer

A

Pros
- Relatively simple to implement
- Scale to large datasets
- Guarantee convergence (local optima)

Cons
- User has to define k
- Depends on initial values
- Cluster boundaries are equidistant to centers
- Cannot model covariances well

Question 3

Q

What are the two types of hierarchical clustering? Define them

Answer

A

Agglomerative (bottom-up): start with all samples as clusters, then merge two cluster with smallest intergroup dissimilarity iteratively

Divisive (top-down): start with one cluster that has all data, then split cluster with largest between-group dissimilarity iteratively

Question 4

Q

Define mixture models

Answer

A

They present p(x) as a mixture of distributions. Can then use Expectation-Maximisation (EM) to find (local) MLE estimators of parameters.

Question 5

Q

Write the formula for “responsibility” in the context of mixture models

Answer

A

Check notes

Question 6

Q

What is the silhouette coefficient? What is it used for?

Answer

A

s_i = (b-a) / {max(a, b)}
a: mean distance between sample i and all other points in the same class
mean distance between sample i and all other points in the next nearest cluster

Can be used to select the optimal number of clusters

Question 7

Q

What does the rand index measure?

Answer

A

The similarity between two cluster assignments

Clustering Flashcards

(7 cards)