Clustering 8 Flashcards

1
Q

What is clustering?

A

• Clustering: the process of grouping a set of objects into clusters of similar objects
• Documents within a cluster should be similar.
• Documents from different clusters should be dissimilar
The most common form of unsupervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Benefits of Clustering?

A

• Whole corpus analysis/navigation
• Better user interface: search without typing
• For improving performance in search applications
• Cluster hypothesis - Documents in the same cluster behave similarly with
respect to relevance to information needs
• Therefore, to improve searching performance:
• Cluster docs in the corpus a priori
• When a query matches a doc D, also return other docs in the cluster
containing D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is hard clustering?

A

Hard clustering: Each document belongs to exactly one cluster
• More common and easier to do

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is hierarchical clustering?

A

Clustering obtained by cutting the dendrogram at a desired level: each
connected component forms a cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Types of Hierarchical Clustering?

A

• Divisive (top down) clustering: starts with all data points in one
cluster, the root, then
• Splits the root into a set of child clusters. Each child cluster is recursively
divided further
• Stops when only singleton clusters of individual data points remain, i.e., each
cluster with only a single point
• Agglomerative (bottom up) clustering: the dendrogram is built from
the bottom level by
• merging the most similar (or nearest) pair of clusters
• stopping when all the data points are merged into a single cluster (i.e., the
root cluster).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do you split a divisive cluster?

A

• At each step, any flat/partition clustering algorithms can be used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Variants of defining closest pair of clusters?

A
  • Single-linkage
  • Similarity of the most cosine-similar (single-link)
  • Complete-linkage
  • Similarity of the “furthest” points, the least cosine-similar
  • Centroidlinkage
  • Clusters whose centroids (centers of gravity) are the most cosine-similar
  • Average-linkage
  • Average cosine between pairs of elements
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you know a method of clustering is good?

A

A good clustering will produce high quality
clusters in which:
• The intra-cluster similarity is high
• The inter-cluster similarity is low
External criteria for clustering quality
• Assesses a clustering with respect to labeled data
• Assume documents with C gold standard classes, while our clustering
algorithms produce K clusters, c1, c2, …, cK , each with ni members.
• Classes: ground truth partitions
• Clusters: machine-generated partitions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is K-means in clustering? And what type of clustering is it?

A

K-Means Flat Clustering
Given: an integer K>0, a set of documents
Select K random docs {s1, s2, …, sK} as seeds (i.e. initial centroids)
Until clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(di,sj) is minimal.
Next, update the seeds to the centroid of each cluster
For each cluster cj:
sj = µ(cj)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Conditions of terminating K-means clustering? Give all three.

A

Possible termination conditions:
• A fixed number of iterations.
• Document partition unchanged.
• Centroid positions don’t change.

• K-means is a special case of a general procedure known as the
Expectation Maximization (EM) algorithm.
• EM is known to converge (to local optimum).
• Number of iterations could be large.
• But in practice usually isn’t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Time Complexity of K-Means?

A

• Computing distance between two docs is O(M) where M is the
dimensionality of the vectors.
• Each iteration has two steps
• Reassigning clusters: O(KN) distance computations, or O(KNM).
• Computing centroids: Each doc gets added once to some centroid: O(NM).
• Hence, the overall computational complexity for each step is O(KNM)
• For I iterations: O(IKNM).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Pros and Cons of K-Means?

A

• Strengths:
• Simple: easy to understand and to implement
• Efficient: Time complexity: O(IKNM), • Since both K and I are usually small, K-means is considered a linear
algorithm.
• K-means is the most popular
clustering algorithm.
• Weaknesses:
• The user needs to specify K • The algorithm is sensitive to outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is K-Medians? and give reasons for its use and disadvantages.

A

• Instead of recomputing the group centre points using the mean, we
use the median vector of the group.
• Less sensitive to outliers (because of using the Median)
• Much slower for larger datasets as sorting is required on each
iteration when computing the Median vector.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Mean-Shift Clustering?

A

• A sliding-window-based algorithm that attempts to find dense areas
of data points.
• A centroid-based algorithm: the goal is to locate the center points of
each cluster, which works by updating candidates for center points to
be the mean of the points within the sliding-window.
• These candidate windows are then filtered in a post-processing stage
to eliminate near-duplicates, forming the final set of centre points
and their corresponding groups.
• Do not need the user to pre-define K

• Begin with a circular sliding window centered at a point C (randomly selected) and having radius r (given by the designer) as the kernel.

• At every iteration, the sliding window is shifted towards regions of higher density by shifting the center point to the mean of the points within the window (hence the name).
• continue shifting the sliding window according to the mean until there is no direction at which a shift can accommodate more points inside the
kernel.
• When multiple sliding windows overlap the window containing the most points is preserved.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Give a pro and con of mean shift clustering?

A

• Pro: there is no need to select
the number of clusters as meanshift automatically discovers this.
• Con: the selection of the window
size/radius “r” can be non-trivial.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Density-Based Spatial Clustering of

Applications with Noise (DBSCAN)?

A

Extend Mean-Shifting by
identifying outliners
• Good at handling spherical data

• Begins with an arbitrary starting data point that has not been visited. The
neighborhood of this point is extracted using a distance epsilon ε (All points
which are within the ε distance are neighborhood points).
• If there are a sufficient number of points (according to minPoints, a pre-defined
hyperparameter) within this neighborhood then the clustering process starts and
the current data point becomes the first point in the new cluster. Otherwise, the
point will be labeled as noise (later this noisy point might become the part of the
cluster). In both cases that point is marked as “visited”.
• For this first point in the new cluster, the points within its ε distance neighbourhood also become part of the same cluster. This procedure of making all points in the ε neighbourhood belong to the same cluster is then repeated for all of the new points that have been just added to the cluster group.

  • Repeat the previous two steps until all points in the cluster are determined i.e. all points within the ε neighbourhood of the cluster have been visited and labelled.
  • Once we’re done with the current cluster, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise.
  • This process repeats until all points are marked as visited. Since at the end of this all points have been visited, each point will have been marked as either belonging to a cluster or being noise.
17
Q

Pros and cons of DBSCAN?

A

Pros:
• Does not require K (the number of clusters)
• Identifies outliers as noises, unlike mean-shift which simply throws them into a cluster even if the data point is very different.
• Can find arbitrarily sized and
arbitrarily shaped clusters quite
well.

Cons:
• Does not perform as well as others when the clusters are of varying density
• Reason: the setting of the
distance threshold ε and minPoints for identifying the neighborhood points will vary from cluster to cluster when the density varies.

18
Q

What is Soft Clustering?

A

• Soft clustering: A document can belong to more than one cluster.
• Makes more sense for applications like creating browsable
hierarchies
• You may want to put a pair of sneakers in two clusters: (i) sports
apparel and (ii) shoes
• You can only do that with a soft clustering approach.

19
Q

What do Partitioning Algorithms do?

A
  • Partitioning method: Construct a partition of n documents into a set of K clusters
  • Given: a set of documents (and, for some algorithms, the number K)
  • Find: a partition of K clusters that optimizes the chosen partitioning criterion (e.g., the average similarity within each cluster is maximized)
20
Q

How does seed choice affect results?

How do you select a good seed?

A

• Results can vary based on random seed
selection.

• Some seeds can result in poor 
convergence rate, or convergence to 
sub-optimal clusterings.
• Select good seeds using a heuristic (e.g., 
doc least similar to any existing mean)
• Try out multiple starting points
• Initialize with the results of another 
method.
21
Q

Single Vs. Complete Linkage

A
  • Single link
  • Encourages growth of elongated clusters
  • Disadvantage: very sensitive to noise
  • Complete link
  • Encourages compact clusters
  • Disadvantage: does not work well when elongated clusters present
22
Q

Divisive Vs. Agglomerative Clustering

A

• Divisive:
• When taking the first step (split), have access to all data points; hence it can
find the best possible split on the highest level
• Less “blind” to the global structure of data
• Agglomerative:
• When taking the first step merging, do not consider the global structure of
the data, but instead only look at pairwise structure
• Agglomerative is faster to compute, in general

23
Q

What is purity?

A

• Purity for each cluster: percentage of the majority class in the cluster

• High purity is easy to achieve when the number of clusters is large - in
particular, purity is 1 if each document gets its own cluster. Thus, we
cannot use purity to trade off the quality of the clustering against the
number of clusters.

24
Q

What does Rand Index do?

A

Rand Index measures

between pair decisions.