Clustering Flashcards

Question 1

Q

On a basic level how is Clustering different from Classification?

Answer

A

Clustering does not have pre-defined class labels.

Each cluster is a set of examples with similar attribute values. There is no special class attribute.

Question 2

Q

What are the two clustering algorithms that were covered in the lecture?

Answer

A

K-means and Clustering based on minimum spanning trees.

Question 3

Q

What are agglomerative algorithms?

Answer

A

Bottom up tree of clusters

Question 4

Q

How does the K-means algorithm work?

Answer

A

Choose at random, K points to be initial centroids of K clusters.
Repeat the previous step multiple times by assigning each example to the cluster with the nearest centroid.
Until no example changed cluster.

Question 5

Q

What is the Repeat-Until loop?

Answer

A

It is the K-mean algorithms process of clustering

Question 6

Q

What are the limitations of K-means?

Answer

A

Tends to only discover clusters of hyper-spherical shape
Sensitive to outliers
Requires predefined number of clusters (often not natural)
Different initial centroids can lead to different clustering results (so it is recommended to run K-means MANY times with different set of initial centroids at each run to see if all runs lead to similar clustering results)
Uses a simple iterative approach that may find a “local” rather than “global” minimum (this is done to minimise the total summation of distances between each cluster centroid and its associated examples (overall clusters))

Question 7

Q

What is a minimum spanning tree?

Answer

A

A tree with minimum weight, out of all spanning trees.

Question 8

Q

What are the steps in a Partitional Clustering which is based on minimum spanning trees>

Answer

A

1) Construct the Minimum Spanning Tree (MST) for the data
2) Identify “inconsistent” edges in the MST
3) Remove inconsistent edges and consider each of the connected components as a cluster

Question 9

Q

How is a an edge inconsistent in Partitional Clustering based on MST?

Answer

A

If its weight (the distance between its two end nodes) is significantly larger than the average weight of the nearby edges.

Question 10

Q

What does K-means attempt to do?

Answer

A

Tries to minimise the total summation of distance between a clusters’ centroid and its associated examples, over all clusters.

Question 11

Q

What does Graph-Based Clustering try to do?

Answer

A

First, construct a minimum spanning tree, connecting all the examples (nodes on the graph), then remove some edges from that MST to create separate clusters.

Question 12

Q

What are the 3 Agglomerative Clustering methods of computing the distance between two closters?

Answer

A

Single-link agglomerative clustering

Complete-link agglomerative clustering

Average-link agglomerative clustering

Question 13

Q

What is single-link agglomerative clustering?

Answer

A

The distance between two clusters is the distance between the NEAREST pair of examples where each objects belongs to a different cluster.

Question 14

Q

What is complete-link agglomerative clustering?

Answer

A

The distance between two clusters is the distance between the MOST distant pair of examples where each object belongs to a different cluster.

Question 15

Q

What is average-link agglomerative clustering?

Answer

A

Average distance between all pairs of clusters

Question 16

Q

What are the limitations of hierarchical clustering?

Answer

Study These Flashcards

A

The decision to merge or split two clusters is greedy. (this cannot be undone later during construction of the dendrogram)

Computationally expensive

After creating the dendrogram we still have to decide where to cut it, in order to produce a set of clusters.

Question 17

Q

What are the challenges in evaluating the quality of clustering solutions?

Answer

Study These Flashcards

A

No golden truth, objective measure of accuracy.

We can minimise the total distance between examples within each cluster but we can also trivially minimise it by assigning each example to a distinct cluster - useless, too many clusters, no generalisation.

Trade-off: we want to, both, minimise the total within-cluster distance and to minimise the number of clusters.

Most partitional algorithms discover clusters of a given shape - imposing structure on the data.

Humans cannot visualise higher than 3-D solutions, where the real value lies.

Question 18

Q

What methods can you use to evaluate Clustering Solutions?

Answer

Study These Flashcards

A

Re-run the same algorithm with random parameters.

Apply the algorithm to random data to check what structure it imposes

Apply algorithm to slightly different version of the data, removing one or a few attributes, and then comparing all the results to when all attributes are ran.

Clustering Flashcards

(18 cards)