Clustering 8 Flashcards

Question 1

Q

What is clustering?

Answer

A

• Clustering: the process of grouping a set of objects into clusters of similar objects
• Documents within a cluster should be similar.
• Documents from different clusters should be dissimilar
The most common form of unsupervised learning

Question 2

Q

Benefits of Clustering?

Answer

A

• Whole corpus analysis/navigation
• Better user interface: search without typing
• For improving performance in search applications
• Cluster hypothesis - Documents in the same cluster behave similarly with
respect to relevance to information needs
• Therefore, to improve searching performance:
• Cluster docs in the corpus a priori
• When a query matches a doc D, also return other docs in the cluster
containing D

Question 3

Q

What is hard clustering?

Answer

A

Hard clustering: Each document belongs to exactly one cluster
• More common and easier to do

Question 4

Q

What is hierarchical clustering?

Answer

A

Clustering obtained by cutting the dendrogram at a desired level: each
connected component forms a cluster.

Question 5

Q

Types of Hierarchical Clustering?

Answer

A

• Divisive (top down) clustering: starts with all data points in one
cluster, the root, then
• Splits the root into a set of child clusters. Each child cluster is recursively
divided further
• Stops when only singleton clusters of individual data points remain, i.e., each
cluster with only a single point
• Agglomerative (bottom up) clustering: the dendrogram is built from
the bottom level by
• merging the most similar (or nearest) pair of clusters
• stopping when all the data points are merged into a single cluster (i.e., the
root cluster).

Question 6

Q

How do you split a divisive cluster?

Answer

A

• At each step, any flat/partition clustering algorithms can be used

Question 7

Q

Variants of defining closest pair of clusters?

Answer

A

Single-linkage
Similarity of the most cosine-similar (single-link)
Complete-linkage
Similarity of the “furthest” points, the least cosine-similar
Centroidlinkage
Clusters whose centroids (centers of gravity) are the most cosine-similar
Average-linkage
Average cosine between pairs of elements

Question 8

Q

How do you know a method of clustering is good?

Answer

A

A good clustering will produce high quality
clusters in which:
• The intra-cluster similarity is high
• The inter-cluster similarity is low
External criteria for clustering quality
• Assesses a clustering with respect to labeled data
• Assume documents with C gold standard classes, while our clustering
algorithms produce K clusters, c1, c2, …, cK , each with ni members.
• Classes: ground truth partitions
• Clusters: machine-generated partitions

Question 9

Q

What is K-means in clustering? And what type of clustering is it?

Answer

A

K-Means Flat Clustering
Given: an integer K>0, a set of documents
Select K random docs {s1, s2, …, sK} as seeds (i.e. initial centroids)
Until clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(di,sj) is minimal.
Next, update the seeds to the centroid of each cluster
For each cluster cj:
sj = µ(cj)

Question 10

Q

Conditions of terminating K-means clustering? Give all three.

Answer

A

Possible termination conditions:
• A fixed number of iterations.
• Document partition unchanged.
• Centroid positions don’t change.

• K-means is a special case of a general procedure known as the
Expectation Maximization (EM) algorithm.
• EM is known to converge (to local optimum).
• Number of iterations could be large.
• But in practice usually isn’t

Question 11

Q

Time Complexity of K-Means?

Answer

A

• Computing distance between two docs is O(M) where M is the
dimensionality of the vectors.
• Each iteration has two steps
• Reassigning clusters: O(KN) distance computations, or O(KNM).
• Computing centroids: Each doc gets added once to some centroid: O(NM).
• Hence, the overall computational complexity for each step is O(KNM)
• For I iterations: O(IKNM).

Question 12

Q

Pros and Cons of K-Means?

Answer

A

• Strengths:
• Simple: easy to understand and to implement
• Efficient: Time complexity: O(IKNM), • Since both K and I are usually small, K-means is considered a linear
algorithm.
• K-means is the most popular
clustering algorithm.
• Weaknesses:
• The user needs to specify K • The algorithm is sensitive to outliers

Question 13

Q

What is K-Medians? and give reasons for its use and disadvantages.

Answer

A

• Instead of recomputing the group centre points using the mean, we
use the median vector of the group.
• Less sensitive to outliers (because of using the Median)
• Much slower for larger datasets as sorting is required on each
iteration when computing the Median vector.

Question 14

Q

What is Mean-Shift Clustering?

Answer

A

• A sliding-window-based algorithm that attempts to find dense areas
of data points.
• A centroid-based algorithm: the goal is to locate the center points of
each cluster, which works by updating candidates for center points to
be the mean of the points within the sliding-window.
• These candidate windows are then filtered in a post-processing stage
to eliminate near-duplicates, forming the final set of centre points
and their corresponding groups.
• Do not need the user to pre-define K

• Begin with a circular sliding window centered at a point C (randomly selected) and having radius r (given by the designer) as the kernel.

• At every iteration, the sliding window is shifted towards regions of higher density by shifting the center point to the mean of the points within the window (hence the name).
• continue shifting the sliding window according to the mean until there is no direction at which a shift can accommodate more points inside the
kernel.
• When multiple sliding windows overlap the window containing the most points is preserved.

Question 15

Q

Give a pro and con of mean shift clustering?

Answer

A

• Pro: there is no need to select
the number of clusters as meanshift automatically discovers this.
• Con: the selection of the window
size/radius “r” can be non-trivial.

Question 16

Q

What is Density-Based Spatial Clustering of

Applications with Noise (DBSCAN)?

Answer

Study These Flashcards

A

Extend Mean-Shifting by
identifying outliners
• Good at handling spherical data

• Begins with an arbitrary starting data point that has not been visited. The
neighborhood of this point is extracted using a distance epsilon ε (All points
which are within the ε distance are neighborhood points).
• If there are a sufficient number of points (according to minPoints, a pre-defined
hyperparameter) within this neighborhood then the clustering process starts and
the current data point becomes the first point in the new cluster. Otherwise, the
point will be labeled as noise (later this noisy point might become the part of the
cluster). In both cases that point is marked as “visited”.
• For this first point in the new cluster, the points within its ε distance neighbourhood also become part of the same cluster. This procedure of making all points in the ε neighbourhood belong to the same cluster is then repeated for all of the new points that have been just added to the cluster group.

Repeat the previous two steps until all points in the cluster are determined i.e. all points within the ε neighbourhood of the cluster have been visited and labelled.
Once we’re done with the current cluster, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise.
This process repeats until all points are marked as visited. Since at the end of this all points have been visited, each point will have been marked as either belonging to a cluster or being noise.

Question 17

Q

Pros and cons of DBSCAN?

Answer

Study These Flashcards

A

Pros:
• Does not require K (the number of clusters)
• Identifies outliers as noises, unlike mean-shift which simply throws them into a cluster even if the data point is very different.
• Can find arbitrarily sized and
arbitrarily shaped clusters quite
well.

Cons:
• Does not perform as well as others when the clusters are of varying density
• Reason: the setting of the
distance threshold ε and minPoints for identifying the neighborhood points will vary from cluster to cluster when the density varies.

Question 18

Q

What is Soft Clustering?

Answer

Study These Flashcards

A

• Soft clustering: A document can belong to more than one cluster.
• Makes more sense for applications like creating browsable
hierarchies
• You may want to put a pair of sneakers in two clusters: (i) sports
apparel and (ii) shoes
• You can only do that with a soft clustering approach.

Question 19

Q

What do Partitioning Algorithms do?

Answer

Study These Flashcards

A

Partitioning method: Construct a partition of n documents into a set of K clusters
Given: a set of documents (and, for some algorithms, the number K)
Find: a partition of K clusters that optimizes the chosen partitioning criterion (e.g., the average similarity within each cluster is maximized)

Question 20

Q

How does seed choice affect results?

How do you select a good seed?

Answer

Study These Flashcards

A

• Results can vary based on random seed
selection.

• Some seeds can result in poor 
convergence rate, or convergence to 
sub-optimal clusterings.
• Select good seeds using a heuristic (e.g., 
doc least similar to any existing mean)
• Try out multiple starting points
• Initialize with the results of another 
method.

Question 21

Q

Single Vs. Complete Linkage

Answer

Study These Flashcards

A

Single link
Encourages growth of elongated clusters
Disadvantage: very sensitive to noise
Complete link
Encourages compact clusters
Disadvantage: does not work well when elongated clusters present

Question 22

Q

Divisive Vs. Agglomerative Clustering

Answer

Study These Flashcards

A

• Divisive:
• When taking the first step (split), have access to all data points; hence it can
find the best possible split on the highest level
• Less “blind” to the global structure of data
• Agglomerative:
• When taking the first step merging, do not consider the global structure of
the data, but instead only look at pairwise structure
• Agglomerative is faster to compute, in general

Question 23

Q

What is purity?

Answer

Study These Flashcards

A

• Purity for each cluster: percentage of the majority class in the cluster

• High purity is easy to achieve when the number of clusters is large - in
particular, purity is 1 if each document gets its own cluster. Thus, we
cannot use purity to trade off the quality of the clustering against the
number of clusters.

Question 24

Q

What does Rand Index do?

Answer

Study These Flashcards

A

Rand Index measures

between pair decisions.

Clustering 8 Flashcards

(24 cards)