06 Clustering Flashcards
1
Q
General objective of clustering
A
2
Q
Example applications of cluster analysis
A
- Visualization and exploratory data analysis
- Many applications within IR. Examples:
- Speed up search: First retrieve the most relevant cluster, then retrieve documents from within the cluster.
- Presenting the search results: Instead of ranked lists, organize the results as clusters.
- Dimensionality reduction / class-based features.
- News aggregation / topic directories.
- Social network analysis; identify sub-communities and user segments.
- Image segmentation, product recommendations, demographic analysis,
3
Q
Main types of clustering methods
A
- Hierarchical
- Flat (partitional clustering)
4
Q
k-means
A
- Unsupervised variant of the Rocchio classifier.
- Goal: Partition the n observed objects into k clusters C so that each point ⃗xj belongs to the cluster ci with the nearest centroid μ⃗i.
- Typically assumes Euclidean distance as the similarity function s.
- The optimization problem: For each cluster, minimize the within-cluster sum of squares, Fs = WCSS
- Equivalent to minimizing the average squared distance between objects and their cluster centroids (since n is fixed) – a measure of how well each centroid represents the members assigned to the cluster.
5
Q
k-means (algorithm and properties)
A
Algorithm
- Initialize: Compute centroids for k seeds.
- Iterate:
- Assign each object to the cluster with the nearest centroid.
- Compute new centroids for the clusters.
- Terminate: When stopping criterion is satisfied.
Properties
- In short, we iteratively reassign memberships and recompute centroids until the configuration stabilizes.
- WCSS is monotonically decreasing (or unchanged) for each iteration.
- Guaranteed to converge but not to find the global minimum.
- The time complexity is linear, O(kn)
6
Q
k-means: Possible termination criterions
A
- Fixed number of iterations
- Clusters or centroids are unchanged between iterations
- Threshold on the decrease of the objective function (absolute or relative to previous iteration)
7
Q
Some close relatives of k-means
A
- k-medoids: Like k-means but uses medoids instead of centroids to represent the cluster centers
- Fuzze c-means (FCM): Like k-means but assigns soft memberships in [0, 1], where membership is a function of the centroid distance.
- The computations of both WCSS and centroids are weighted by the membership function.
8
Q
Flat Clustering: The good and the bad
A
+ Conceptually simple, and easy to implement
+ Efficient. Typically linear in the number of objects.
- The dependence on random seeds as in k-means makes the clustering non-deterministic.
- The number of clusters k must be pre-specified. Often no principled means of a priori specifying k.
- The clustering quality often considered inferior to that of the less efficient hierarchical methods.
- Not as informative as the more stuctured clusterings produced by hierarchical methods.
9
Q
k-means clustering can be thought of as performing _____ in each iteration.
A
k-means clustering can be thought of as performing Rocchio classification in each iteration.