06 Clustering Flashcards

1
Q

General objective of clustering

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Example applications of cluster analysis

A
  • Visualization and exploratory data analysis
  • Many applications within IR. Examples:
    • Speed up search: First retrieve the most relevant cluster, then retrieve documents from within the cluster.
    • 􏰁Presenting the search results: Instead of ranked lists, organize the results as clusters.
  • Dimensionality reduction / class-based features.
  • 􏰁News aggregation / topic directories.
  • 􏰁Social network analysis; identify sub-communities and user segments.
  • 􏰁Image segmentation, product recommendations, demographic analysis,
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Main types of clustering methods

A
  • Hierarchical
  • Flat (partitional clustering)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

k-means

A
  • 􏰁Unsupervised variant of the Rocchio classifier.
  • 􏰁Goal: Partition the n observed objects into k clusters C so that each point ⃗xj belongs to the cluster ci with the nearest centroid μ⃗i.
  • 􏰁Typically assumes Euclidean distance as the similarity function s.
  • 􏰁The optimization problem: For each cluster, minimize the within-cluster sum of squares, Fs = WCSS
  • 􏰁Equivalent to minimizing the average squared distance between objects and their cluster centroids (since n is fixed) – a measure of how well each centroid represents the members assigned to the cluster.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

k-means (algorithm and properties)

A

Algorithm

  • Initialize: Compute centroids for k seeds.
  • Iterate:
    • Assign each object to the cluster with the nearest centroid.
    • Compute new centroids for the clusters.
  • Terminate: When stopping criterion is satisfied.

Properties

  • 􏰁In short, we iteratively reassign memberships and recompute centroids until the configuration stabilizes.
  • 􏰁WCSS is monotonically decreasing (or unchanged) for each iteration.
  • 􏰁Guaranteed to converge but not to find the global minimum.
  • 􏰁The time complexity is linear, O(kn)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

k-means: Possible termination criterions

A
  • Fixed number of iterations
  • Clusters or centroids are unchanged between iterations
  • Threshold on the decrease of the objective function (absolute or relative to previous iteration)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Some close relatives of k-means

A
  • k-medoids: Like k-means but uses medoids instead of centroids to represent the cluster centers
  • Fuzze c-means (FCM): Like k-means but assigns soft memberships in [0, 1], where membership is a function of the centroid distance.
    • The computations of both WCSS and centroids are weighted by the membership function.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Flat Clustering: The good and the bad

A

+ Conceptually simple, and easy to implement
+ Efficient. Typically linear in the number of objects.

  • The dependence on random seeds as in k-means makes the clustering non-deterministic.
  • The number of clusters k must be pre-specified. Often no principled means of a priori specifying k.
  • The clustering quality often considered inferior to that of the less efficient hierarchical methods.
  • Not as informative as the more stuctured clusterings produced by hierarchical methods.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

k-means clustering can be thought of as performing _____ in each iteration.

A

k-means clustering can be thought of as performing Rocchio classification in each iteration.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly