Clustering Flashcards

Question 1

Q

What are some applications of clustering?

Answer

A

Image segmentation
Social network analysis
Bioinformatics

Question 2

Q

What is clustering?

Answer

A

The process of grouping a set of instances into classes of similar instances

Question 3

Q

What are the two types of clustering algorithms?

Answer

A

Partitional algorithms
Hierarchical algorithms

Question 4

Q

What are some considerations we need to make for clustering?

Answer

A

How to measure similarity between two things
What approach to take
How many clusters

Question 5

Q

What is a centroid?

Answer

A

A point that is considered to be the centre of a cluster

Question 6

Q

What are the steps for the k-means algorithm?

Answer

A

Compute dist from all data to all centroids
For each data point, assign it to whichever centroid is closer
for each centroid, compute the mean of all points assigned to it
Replace the centroids with the new averages

Question 7

Q

Can the final result of k-means clustering be affected by the initial centroid choice?

Answer

A

Yes - some can result in a poor convergence rate, or convergence to sub optimal clusterings

Question 8

Q

What is the time complexity of k-means clustering?

Answer

A

O(iknd)

i - iterations
k - num of centroids
n - num of data points
d - number of features of the data points

Question 9

Q

What are the limitations of k-means clustering?

Answer

A

Must choose param k in advance
Data must be numerical and must be able to be compared via a suitable measure
Algorithm sensitive to outliers that do not belong in a cluster
works best on data with spherical clusters

Question 10

Q

How do we measure cluster validity?

Answer

A

High inter (between) cluster distances
Low intra (within) cluster distances

Question 11

Q

What are the two kinds of hierarchical clustering algorithms?

Answer

A

Agglomerative (bottom up)
Divisive (top down)

Question 12

Q

How does HAC work?

Answer

A

Start with each point being a cluster
Merge the closest points
Eventually all points belong to the same cluster

Question 13

Q

How does HDC work?

Answer

A

Start with all points belonging to one cluster
Split furthest points
Eventually each node is an individual cluster

Question 14

Q

What are the steps for HAC?

Answer

A

Compute the distance matrix (= distance between any 2 data points)
Let each data point be a cluster
Repeat:
Merge the two (or more) closest clusters
Update the distance matrix
Until only a single cluster remains

Question 15

Q

What are the different ways to define closest clusters?

Answer

A

Single link : dist of closest
Complete link : dist of furthest
Centroid : dist of cog
Average link : average dist of data points

Question 16

Q

What is the time complexity of all HAC methods without utilising a heap?

Answer

Study These Flashcards

A

O(dn^2)
d - number of features per data point
n - num of individual data points

Question 17

Q

What is the time complexity of all HAC methods, utilising a heap?

Answer

Study These Flashcards

A

O(dn^2 log(n) )
d - number of features per data point
n - num of individual data points

Clustering Flashcards

(17 cards)