4. Clustering Flashcards
What is the goal of clustering?
When given a set of points with a notion of distance, group the points into a number of clusters so that members of a cluster are similar to each other and members of different clusters are dissimilar
Why is clustering a hard problem?
We are often working in very high dimensional space where almost all pairs of points are at about the same distance.
There is also a large amount of data in general
What are the 2 methods for clusteting?
- Hierarchical (agglomerative and divisive)
- Point assignment where points belong to their nearest cluster
What is the difference between agglomerative and divisive hierarchical clustering?
Agglomerative is bottom-up. Each point starts as a cluster and the two nearest clusters are repeatedly combined
Divisive is top-down. Each point starts in the same cluster. We split the cluster repeatedly until we get the desired number of clusters
What are the important questions associated with hierarchical clustering?
- How do you represent a cluster of more than 1 point?
- How do you determine the nearness of clusters
- When do you stop combining clusters?
How does the Euclidean space model answer the questions about hierarchical clustering?
- Each cluster is represented by its centroid
- Cluster distances are measured by distance between centroids
- Combine until we get the desired number of clusters
How does the non-Euclidean space model answer the questions about hierarchical clustering?
- Each cluster is represented by a clustroid which is the data point closest to other points
- Cluster distances are measured by distance between clustroids
- Combine until we get the desired number of clusters
What is the difference between a centroid and clustroid?
A centroid is the average of data points in the cluster. It is an artificial point
A clustroid is an existing data point that is closest to all other points in the cluster
How do we determine which point becomes the clustroid?
- Smallest maximum distance to other points
- Smallest average distance to other points
- Smallest sum of squares of distances to other points
What are the 2 approaches for determining how many clusters to create?
- Pick a number k up front and stop when you have k clusters
- Stop when the next merge would create a cluster with low cohesion
What are the 3 methods for determining cohesion of a cluster?
- Diameter of the merged cluster which is max distance between points in the cluster
- Radius which is the maximum distance of a point from the centroid or clustroid
- Use a density based approach by taking diameter or average distance and divide by number of points in the cluster
What is the issue with hierarchical clustering?
It is slow. It is O(n^3) for the naive approach and O(n^2 log n) for careful implementation
How does the k-means clustering work?
- Place each point in the k clusters with the nearest centroid
- Update locations of centroids
- Reassign all points to their closest centroid
- Repeat until points don’t move between points and centroids stabilize
How do we select k for k-means clustering?
Look at average distance to centroid as k increases. It will fall rapidly until the correct k and then change little
What are the 2 approaches for picking the initial k points for clusters?
- Sampling: Use hierarchical clustering to get k clusters, pick a point from each, and use them as centroids
- Dispersed set: Pick a random point then pick k-1 points such that distance from selected points is as far as possible