Clustering Flashcards

Question 1

Q

What is a Centroid

Answer

A

The mean vector of all items assigned to a given cluster (i.e. the mean of their feature vectors)

Question 2

Q

What are partitioning methods (Clustering)

Answer

A

Clustering methods that partition a dataset into a predefined number of clusters

Question 3

Q

Steps of Lloyd’s Algorithm for k-means clustering

Answer

A

randomly select k initial cluster centroids
assign every item to its nearest centroid
recompute the centroids
Repeat from step 2 until no reassignments occur

Question 4

Q

What are the limitations of k-Means, in particular Lloyd’s algorithm

Answer

A

must specify number of cluster k
Lloyd’s algorithm is highly sensitive to choice of initial clusters
assumes spherical shaped clusters
unable to handle outliers and noise

Question 5

Q

Advantages of k-Means

Answer

A

fast, easy to implement the heuristics
“good enough” in a wide variety of tasks and domains

Question 6

Q

What is a common strategy for tackling cluster initialisation in k-Means

Answer

A

run the algorithm multiple times, select the solution(s) that scores well according to some validation measure

Question 7

Q

Explain how the centroids are initialised in the k-means++

Answer

A

select first centroid uniformly at random
Repeat k - 1 times:

pick a point with probability proportional to the shortest distanced centroid
add point to Centroids

Question 8

Q

What is the intuition for k-means++ initialisation

Answer

A

The cluster centroids are initialised by favouring points that are farther away from existing centroids, resulting in the centroids being spread out across the dataset.

Question 9

Q

What is cluster validation

Answer

A

Measures for automatically producing a quantitative evaluation of the quality of a clustering

Question 10

Q

What is the Silhouette Measure

Answer

A

Validation measure which quantifies degree to which each item belongs in its assigned cluster, relative to the other clusters

Question 11

Q

How to calculate the silhouette width

Answer

A

a = average distance to all other items in same cluster
b = average distance to all other items in nearest competing cluster
silhouette width = (b-a) / max(a, b)

Question 12

Q

How to compute the average silhouette width (ASW)

Answer

A

calculate overall score for a clustering by averaging the silhouette widths for all items

Question 13

Q

Brief description of Agglomerative Hierarchical Clustering

Answer

A

Begin with each item assigned to its own cluster. Apply a bottom-up strategy where, at each step, the most similar pair of clusters are merged

Question 14

Q

Brief description of Divisive Hierarchical Clustering

Answer

A

Begin with a single cluster containing all items. Apply a top-down strategy where, at each step, a chosen cluster is split into two sub-clusters

Question 15

Q

AGNES algorithm

Answer

A

Inputs: Distance matrix, Cluster Metric
1. Assign every item to its own cluster (leaf nodes)
2. Find the closest pair of clusters and merge them
3. Compute distances between the new cluster and each of the remaining cluster
4. Repeat from step 2 until there is one cluster (root node)

Question 16

Q

Define the three Cluster Metrics (Linkage)

Answer

Study These Flashcards

A

Single Linkage: define cluster distance as the smallest pairwise distance between items from each cluster
Complete Linkage: define cluster distance as the largest pairwise distance between items from each cluster
Average Linkage: define cluster distance as the average of all pairwise distances between items from each cluster

Question 17

Q

Divisive Algorithms Template

Answer

Study These Flashcards

A

Start with all items in a single cluster
Repeat until all items are in their own cluster

choose an existing cluster to split using some splitting criterion
replace the chosen cluster into two sub-clusters

Question 18

Q

Advantages of Hierarchical Clustering

Answer

Study These Flashcards

A

Allows for multiple levels of granularity, both broad clusters and niche clusters.
No requirement to select the “correct” value for number of clusters in advance.

Question 19

Q

Disadvantages of Hierarchical Clustering

Answer

Study These Flashcards

A

Poor decisions made early in the clustering process can greatly influence the quality of the final clustering.
Once a merging or splitting decision has been made, there exists no facility to rectify a mistake at a later stage.
More computationally expensive than partitional methods, particularly for AGNES