Clustering Flashcards

1
Q

Explain the steps taken in the K-means algorithm

A
  1. Randomly generate k random centroids.
  2. Check the distance from each sample to all centroids.
  3. Assign the closest centroid to each sample.
  4. Compute new centroids.
  5. If no convergence, go back to 2.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the pros and cons of k-means clustering?

A

Pros
- Relatively simple to implement
- Scale to large datasets
- Guarantee convergence (local optima)

Cons
- User has to define k
- Depends on initial values
- Cluster boundaries are equidistant to centers
- Cannot model covariances well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the two types of hierarchical clustering? Define them

A

Agglomerative (bottom-up): start with all samples as clusters, then merge two cluster with smallest intergroup dissimilarity iteratively

Divisive (top-down): start with one cluster that has all data, then split cluster with largest between-group dissimilarity iteratively

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define mixture models

A

They present p(x) as a mixture of distributions. Can then use Expectation-Maximisation (EM) to find (local) MLE estimators of parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Write the formula for “responsibility” in the context of mixture models

A

Check notes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the silhouette coefficient? What is it used for?

A

s_i = (b-a) / {max(a, b)}
a: mean distance between sample i and all other points in the same class
mean distance between sample i and all other points in the next nearest cluster

Can be used to select the optimal number of clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the rand index measure?

A

The similarity between two cluster assignments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly