Clustering Flashcards

1
Q

What are some applications of clustering?

A

Image segmentation
Social network analysis
Bioinformatics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is clustering?

A

The process of grouping a set of instances into classes of similar instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the two types of clustering algorithms?

A

Partitional algorithms
Hierarchical algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some considerations we need to make for clustering?

A

How to measure similarity between two things
What approach to take
How many clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a centroid?

A

A point that is considered to be the centre of a cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the steps for the k-means algorithm?

A
  1. Compute dist from all data to all centroids
  2. For each data point, assign it to whichever centroid is closer
  3. for each centroid, compute the mean of all points assigned to it
  4. Replace the centroids with the new averages
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Can the final result of k-means clustering be affected by the initial centroid choice?

A

Yes - some can result in a poor convergence rate, or convergence to sub optimal clusterings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the time complexity of k-means clustering?

A

O(iknd)

i - iterations
k - num of centroids
n - num of data points
d - number of features of the data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the limitations of k-means clustering?

A
  • Must choose param k in advance
  • Data must be numerical and must be able to be compared via a suitable measure
  • Algorithm sensitive to outliers that do not belong in a cluster
  • works best on data with spherical clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do we measure cluster validity?

A
  • High inter (between) cluster distances
  • Low intra (within) cluster distances
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two kinds of hierarchical clustering algorithms?

A
  • Agglomerative (bottom up)
  • Divisive (top down)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does HAC work?

A
  • Start with each point being a cluster
  • Merge the closest points
  • Eventually all points belong to the same cluster
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does HDC work?

A
  • Start with all points belonging to one cluster
  • Split furthest points
  • Eventually each node is an individual cluster
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the steps for HAC?

A

Compute the distance matrix (= distance between any 2 data points)
Let each data point be a cluster
Repeat:
Merge the two (or more) closest clusters
Update the distance matrix
Until only a single cluster remains

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the different ways to define closest clusters?

A
  • Single link : dist of closest
  • Complete link : dist of furthest
  • Centroid : dist of cog
  • Average link : average dist of data points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the time complexity of all HAC methods without utilising a heap?

A

O(dn^2)
d - number of features per data point
n - num of individual data points

17
Q

What is the time complexity of all HAC methods, utilising a heap?

A

O(dn^2 log(n) )
d - number of features per data point
n - num of individual data points