06 Clustering Flashcards

Question 1

Q

General objective of clustering

Question 2

Q

Example applications of cluster analysis

Answer

A

Visualization and exploratory data analysis
Many applications within IR. Examples:
- Speed up search: First retrieve the most relevant cluster, then retrieve documents from within the cluster.
- 􏰁Presenting the search results: Instead of ranked lists, organize the results as clusters.
Dimensionality reduction / class-based features.
􏰁News aggregation / topic directories.
􏰁Social network analysis; identify sub-communities and user segments.
􏰁Image segmentation, product recommendations, demographic analysis,

Question 3

Q

Main types of clustering methods

Answer

A

Question 4

Q

k-means

Answer

A

􏰁Unsupervised variant of the Rocchio classifier.
􏰁Goal: Partition the n observed objects into k clusters C so that each point ⃗xj belongs to the cluster ci with the nearest centroid μ⃗i.
􏰁Typically assumes Euclidean distance as the similarity function s.
􏰁The optimization problem: For each cluster, minimize the within-cluster sum of squares, Fs = WCSS
􏰁Equivalent to minimizing the average squared distance between objects and their cluster centroids (since n is fixed) – a measure of how well each centroid represents the members assigned to the cluster.

Question 5

Q

k-means (algorithm and properties)

Answer

A

Algorithm

Initialize: Compute centroids for k seeds.
Iterate:
- Assign each object to the cluster with the nearest centroid.
- Compute new centroids for the clusters.
Terminate: When stopping criterion is satisfied.

Properties

􏰁In short, we iteratively reassign memberships and recompute centroids until the configuration stabilizes.
􏰁WCSS is monotonically decreasing (or unchanged) for each iteration.
􏰁Guaranteed to converge but not to find the global minimum.
􏰁The time complexity is linear, O(kn)

Question 6

Q

k-means: Possible termination criterions

Answer

A

Fixed number of iterations
Clusters or centroids are unchanged between iterations
Threshold on the decrease of the objective function (absolute or relative to previous iteration)

Question 7

Q

Some close relatives of k-means

Answer

A

k-medoids: Like k-means but uses medoids instead of centroids to represent the cluster centers
Fuzze c-means (FCM): Like k-means but assigns soft memberships in [0, 1], where membership is a function of the centroid distance.
- The computations of both WCSS and centroids are weighted by the membership function.

Question 8

Q

Flat Clustering: The good and the bad

Answer

A

+ Conceptually simple, and easy to implement
+ Efficient. Typically linear in the number of objects.

The dependence on random seeds as in k-means makes the clustering non-deterministic.
The number of clusters k must be pre-specified. Often no principled means of a priori specifying k.
The clustering quality often considered inferior to that of the less efficient hierarchical methods.
Not as informative as the more stuctured clusterings produced by hierarchical methods.

Question 9

Q

k-means clustering can be thought of as performing _____ in each iteration.

Answer

A

k-means clustering can be thought of as performing Rocchio classification in each iteration.

(9 cards)