4. Clustering Flashcards

1
Q

What is the goal of clustering?

A

When given a set of points with a notion of distance, group the points into a number of clusters so that members of a cluster are similar to each other and members of different clusters are dissimilar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is clustering a hard problem?

A

We are often working in very high dimensional space where almost all pairs of points are at about the same distance.
There is also a large amount of data in general

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 2 methods for clusteting?

A
  1. Hierarchical (agglomerative and divisive)
  2. Point assignment where points belong to their nearest cluster
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between agglomerative and divisive hierarchical clustering?

A

Agglomerative is bottom-up. Each point starts as a cluster and the two nearest clusters are repeatedly combined
Divisive is top-down. Each point starts in the same cluster. We split the cluster repeatedly until we get the desired number of clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the important questions associated with hierarchical clustering?

A
  1. How do you represent a cluster of more than 1 point?
  2. How do you determine the nearness of clusters
  3. When do you stop combining clusters?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does the Euclidean space model answer the questions about hierarchical clustering?

A
  1. Each cluster is represented by its centroid
  2. Cluster distances are measured by distance between centroids
  3. Combine until we get the desired number of clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does the non-Euclidean space model answer the questions about hierarchical clustering?

A
  1. Each cluster is represented by a clustroid which is the data point closest to other points
  2. Cluster distances are measured by distance between clustroids
  3. Combine until we get the desired number of clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between a centroid and clustroid?

A

A centroid is the average of data points in the cluster. It is an artificial point
A clustroid is an existing data point that is closest to all other points in the cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do we determine which point becomes the clustroid?

A
  1. Smallest maximum distance to other points
  2. Smallest average distance to other points
  3. Smallest sum of squares of distances to other points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 2 approaches for determining how many clusters to create?

A
  1. Pick a number k up front and stop when you have k clusters
  2. Stop when the next merge would create a cluster with low cohesion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 3 methods for determining cohesion of a cluster?

A
  1. Diameter of the merged cluster which is max distance between points in the cluster
  2. Radius which is the maximum distance of a point from the centroid or clustroid
  3. Use a density based approach by taking diameter or average distance and divide by number of points in the cluster
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the issue with hierarchical clustering?

A

It is slow. It is O(n^3) for the naive approach and O(n^2 log n) for careful implementation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does the k-means clustering work?

A
  1. Place each point in the k clusters with the nearest centroid
  2. Update locations of centroids
  3. Reassign all points to their closest centroid
  4. Repeat until points don’t move between points and centroids stabilize
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do we select k for k-means clustering?

A

Look at average distance to centroid as k increases. It will fall rapidly until the correct k and then change little

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the 2 approaches for picking the initial k points for clusters?

A
  1. Sampling: Use hierarchical clustering to get k clusters, pick a point from each, and use them as centroids
  2. Dispersed set: Pick a random point then pick k-1 points such that distance from selected points is as far as possible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the BFR algorithm?

A

A variant of k-means clustering to handle very large data sets. Meant to reduce memory usage from O(data) to O(clusters)

17
Q

What is the key assumption of the BFR algorithm?

A

Assumes clusters are normally distributed around a centroid in Euclidean space. Results in axis-aligned ellipses as clusters

18
Q

What are the 3 classes of points in the BFR algorithm?

A
  1. Discard set: Points close enough to a centroid to be summarized
  2. Compression set: Points that are close together but not close to a centroid. They are summarized but not added to a cluster
  3. Retained set: Isolated points waiting to be assigned to a compression set
19
Q

How is each cluster summarized using the BFR algorithm?

A

The discard set of the cluster is summarized by the number of points N, the vector SUM that contains the sum of coordinates in the ith position for all i, the vector SUMSQ that contains the sum of squares of coordinates in the ith position

20
Q

How do we perform the BFR algorithm?

A

Read points into memory from disk one main-memory-full at a time
Points from previous memory loads are summarized by the sample statistics for the 3 point sets

21
Q

How can we calculate the average in each dimension using BFR?

A

For ith dimension we can calculate SUMi / N

22
Q

How can we calculate variance of a cluster’s discard set using BFR?

A

For ith dimension we calculate
(SUMSQi)/N) - (SUMi/N)^2

23
Q

How do we decide whether to put a new point into a cluster in BFR?

A

If there is a high likelihood of the point belonging to the currently nearest centroid according to the normal distribution

24
Q

How do we determine if 2 compression set subclusters should be combined?

A

Compute the variance of the proposed combined cluster. Combine if variance is below some threshold

25
Q

What is the issue with the BFR algorithm?

A

Its normal distribution assumption means clusters must be ellipses fixed to an axis, cannot do angles

26
Q

What is the CURE algorithm?

A

Clustering Using REpresentatives
Assumes euclidean distance, allows clusters to be any shape, uses a collection of representative points to represent clusters

27
Q

What happens during the 1st pass over the data of the CURE algorithm?

A
  1. Pick a random sample of points that fit in main memory
  2. Cluster initial points hierarchically
  3. Pick representative points for each cluster as dispersed as possible. Pick furthest point and move them some distance toward the centroid
28
Q

What happens during the 2nd pass over the data of the CURE algorithm?

A
  1. Rescan the dataset
  2. Place each point in the closest cluster by finding the closestt representative