Clustering Flashcards

1
Q

What is a Centroid

A

The mean vector of all items assigned to a given cluster (i.e. the mean of their feature vectors)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are partitioning methods (Clustering)

A

Clustering methods that partition a dataset into a predefined number of clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Steps of Lloyd’s Algorithm for k-means clustering

A
  1. randomly select k initial cluster centroids
  2. assign every item to its nearest centroid
  3. recompute the centroids
  4. Repeat from step 2 until no reassignments occur
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the limitations of k-Means, in particular Lloyd’s algorithm

A
  • must specify number of cluster k
  • Lloyd’s algorithm is highly sensitive to choice of initial clusters
  • assumes spherical shaped clusters
  • unable to handle outliers and noise
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Advantages of k-Means

A
  • fast, easy to implement the heuristics
  • “good enough” in a wide variety of tasks and domains
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a common strategy for tackling cluster initialisation in k-Means

A

run the algorithm multiple times, select the solution(s) that scores well according to some validation measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain how the centroids are initialised in the k-means++

A
  1. select first centroid uniformly at random
  2. Repeat k - 1 times:
  • pick a point with probability proportional to the shortest distanced centroid
  • add point to Centroids
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the intuition for k-means++ initialisation

A

The cluster centroids are initialised by favouring points that are farther away from existing centroids, resulting in the centroids being spread out across the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is cluster validation

A

Measures for automatically producing a quantitative evaluation of the quality of a clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the Silhouette Measure

A

Validation measure which quantifies degree to which each item belongs in its assigned cluster, relative to the other clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How to calculate the silhouette width

A

a = average distance to all other items in same cluster
b = average distance to all other items in nearest competing cluster
silhouette width = (b-a) / max(a, b)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to compute the average silhouette width (ASW)

A

calculate overall score for a clustering by averaging the silhouette widths for all items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Brief description of Agglomerative Hierarchical Clustering

A

Begin with each item assigned to its own cluster. Apply a bottom-up strategy where, at each step, the most similar pair of clusters are merged

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Brief description of Divisive Hierarchical Clustering

A

Begin with a single cluster containing all items. Apply a top-down strategy where, at each step, a chosen cluster is split into two sub-clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

AGNES algorithm

A

Inputs: Distance matrix, Cluster Metric
1. Assign every item to its own cluster (leaf nodes)
2. Find the closest pair of clusters and merge them
3. Compute distances between the new cluster and each of the remaining cluster
4. Repeat from step 2 until there is one cluster (root node)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define the three Cluster Metrics (Linkage)

A

Single Linkage: define cluster distance as the smallest pairwise distance between items from each cluster
Complete Linkage: define cluster distance as the largest pairwise distance between items from each cluster
Average Linkage: define cluster distance as the average of all pairwise distances between items from each cluster

17
Q

Divisive Algorithms Template

A
  1. Start with all items in a single cluster
  2. Repeat until all items are in their own cluster
  • choose an existing cluster to split using some splitting criterion
  • replace the chosen cluster into two sub-clusters
18
Q

Advantages of Hierarchical Clustering

A
  • Allows for multiple levels of granularity, both broad clusters and niche clusters.
  • No requirement to select the “correct” value for number of clusters in advance.
19
Q

Disadvantages of Hierarchical Clustering

A
  • Poor decisions made early in the clustering process can greatly influence the quality of the final clustering.
  • Once a merging or splitting decision has been made, there exists no facility to rectify a mistake at a later stage.
  • More computationally expensive than partitional methods, particularly for AGNES
20
Q

What constitutes high quality clusters

A

high intra-cluster similarity and low inter-cluster similarity

21
Q

in K-means, the global optimum may be found using what techniques?

A
  • deterministic annealing
  • genetic algorithms
22
Q

What is a Medoid

A

A point in the cluster, whose dissimilarities with all the other points in the cluster are minimum

23
Q

what is the basic idea of Partitioning Around Medoids (PAM)

A

start from an initial set of medoids and iteratively replace a medoid by a non-medoid if it reduces the total distance of the resulting clustering