Clustering 8 Flashcards
What is clustering?
• Clustering: the process of grouping a set of objects into clusters of similar objects
• Documents within a cluster should be similar.
• Documents from different clusters should be dissimilar
The most common form of unsupervised learning
Benefits of Clustering?
• Whole corpus analysis/navigation
• Better user interface: search without typing
• For improving performance in search applications
• Cluster hypothesis - Documents in the same cluster behave similarly with
respect to relevance to information needs
• Therefore, to improve searching performance:
• Cluster docs in the corpus a priori
• When a query matches a doc D, also return other docs in the cluster
containing D
What is hard clustering?
Hard clustering: Each document belongs to exactly one cluster
• More common and easier to do
What is hierarchical clustering?
Clustering obtained by cutting the dendrogram at a desired level: each
connected component forms a cluster.
Types of Hierarchical Clustering?
• Divisive (top down) clustering: starts with all data points in one
cluster, the root, then
• Splits the root into a set of child clusters. Each child cluster is recursively
divided further
• Stops when only singleton clusters of individual data points remain, i.e., each
cluster with only a single point
• Agglomerative (bottom up) clustering: the dendrogram is built from
the bottom level by
• merging the most similar (or nearest) pair of clusters
• stopping when all the data points are merged into a single cluster (i.e., the
root cluster).
How do you split a divisive cluster?
• At each step, any flat/partition clustering algorithms can be used
Variants of defining closest pair of clusters?
- Single-linkage
- Similarity of the most cosine-similar (single-link)
- Complete-linkage
- Similarity of the “furthest” points, the least cosine-similar
- Centroidlinkage
- Clusters whose centroids (centers of gravity) are the most cosine-similar
- Average-linkage
- Average cosine between pairs of elements
How do you know a method of clustering is good?
A good clustering will produce high quality
clusters in which:
• The intra-cluster similarity is high
• The inter-cluster similarity is low
External criteria for clustering quality
• Assesses a clustering with respect to labeled data
• Assume documents with C gold standard classes, while our clustering
algorithms produce K clusters, c1, c2, …, cK , each with ni members.
• Classes: ground truth partitions
• Clusters: machine-generated partitions
What is K-means in clustering? And what type of clustering is it?
K-Means Flat Clustering
Given: an integer K>0, a set of documents
Select K random docs {s1, s2, …, sK} as seeds (i.e. initial centroids)
Until clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(di,sj) is minimal.
Next, update the seeds to the centroid of each cluster
For each cluster cj:
sj = µ(cj)
Conditions of terminating K-means clustering? Give all three.
Possible termination conditions:
• A fixed number of iterations.
• Document partition unchanged.
• Centroid positions don’t change.
• K-means is a special case of a general procedure known as the
Expectation Maximization (EM) algorithm.
• EM is known to converge (to local optimum).
• Number of iterations could be large.
• But in practice usually isn’t
Time Complexity of K-Means?
• Computing distance between two docs is O(M) where M is the
dimensionality of the vectors.
• Each iteration has two steps
• Reassigning clusters: O(KN) distance computations, or O(KNM).
• Computing centroids: Each doc gets added once to some centroid: O(NM).
• Hence, the overall computational complexity for each step is O(KNM)
• For I iterations: O(IKNM).
Pros and Cons of K-Means?
• Strengths:
• Simple: easy to understand and to implement
• Efficient: Time complexity: O(IKNM), • Since both K and I are usually small, K-means is considered a linear
algorithm.
• K-means is the most popular
clustering algorithm.
• Weaknesses:
• The user needs to specify K • The algorithm is sensitive to outliers
What is K-Medians? and give reasons for its use and disadvantages.
• Instead of recomputing the group centre points using the mean, we
use the median vector of the group.
• Less sensitive to outliers (because of using the Median)
• Much slower for larger datasets as sorting is required on each
iteration when computing the Median vector.
What is Mean-Shift Clustering?
• A sliding-window-based algorithm that attempts to find dense areas
of data points.
• A centroid-based algorithm: the goal is to locate the center points of
each cluster, which works by updating candidates for center points to
be the mean of the points within the sliding-window.
• These candidate windows are then filtered in a post-processing stage
to eliminate near-duplicates, forming the final set of centre points
and their corresponding groups.
• Do not need the user to pre-define K
• Begin with a circular sliding window centered at a point C (randomly selected) and having radius r (given by the designer) as the kernel.
• At every iteration, the sliding window is shifted towards regions of higher density by shifting the center point to the mean of the points within the window (hence the name).
• continue shifting the sliding window according to the mean until there is no direction at which a shift can accommodate more points inside the
kernel.
• When multiple sliding windows overlap the window containing the most points is preserved.
Give a pro and con of mean shift clustering?
• Pro: there is no need to select
the number of clusters as meanshift automatically discovers this.
• Con: the selection of the window
size/radius “r” can be non-trivial.