Clustering Flashcards

1
Q

Clustering

A

A data mining technique that involves the grouping of numerous points.
It is a form on unsupervised learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the types of clusters?

2

A

1) Hard clusters - a type of clustering where each point belongs to a specific cluster completely or not at all.
2) Soft clusters - a type of clustering where instead of assigning each point to a specific cluster, each point is assigned a probabilty to be placed in a cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some of the ways to calculate the distance between members in a cluster?

3

A

1) Set of vectors - measures similarity using the cosine distance
2) Sets of sets - measure similarity by using the Jaccard distance
3) Sets of points - measures the similarity using Euclidean distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

k-means clustering

A

Form of unsupervised learning.

Assume k is our number of clusters and we are in Euclidean space.

Algorithm:
1) For each point, place it in the cluster whose current centroid is nearest.
2) After all points are assigned update the locations of centroids for each cluster.
3) Reassign points to the nearest centroid
4) Repeat steps 2 and 3 until convergence happens (stage where points do not move)

Note that we choose k by choosing the kink in the graph. Where x axis is number of k’s and y value is average distance to centroid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the other methods of hierachical clustering?

2

A

1) Agglomerative (bottom up) - each point is it’s own cluster from there you repeatedly combine the nearest points.
2) Divisive (top down) - start at the top and split recursively until you cannot split anymore.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a dendogram?

A

Slide 225

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you calcualte cohesion?

A

It is WSS or within cluster sum of squares.
WSS = SSE= SUM(data points - centroids)^2
BSS = SUM(number of points cluster * distance(centroid, total sample mean))

BSS means the sum of distances between the centroids and the total sample mean multiplied by the number of points within each cluster

Formula (conceptually):

BCSS=∑𝑘=1𝐾𝑛𝑘⋅∥𝜇𝑘−𝜇∥^2

Where:
𝑛𝑘 = number of points in cluster k
k = centroid of cluster
μ = overall mean of all data points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Suppose we have the following points:
|—-m1—-|———m———|—-m2—-|
1————2———3———4————-5

Do these calculations for k = 1, k = 2 and k = 4. The WSS and BSS calculation.

A

K = 1
WSS = (1-3)^2+(2-3)^2+(4-3)^2+(5-3)^2 = 10
BSS = 4 * (3-3)^2 = 0
Total = 10 + 0 = 10

K = 2
WSS = (1-1.5)^2+(2-1.5)^2+(4-4.5)^2+(5-4.5)^2 = 1
BSS = 2 * (3-1.5)^2 + 2 * (4.5-3)^2= 9
Total = 1 + 9 = 10

K = 4
WSS = (1-1)^2+(2-2)^2+(4-4)^2+(5-5)^2 = 0
BSS = 1 * (1-3)^2 + 1 * (2-3)^2 + 1 * (4-3)^2 + 1 * (5-3)^2 = 10
Total = 10 + 0 = 10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Silhouette Coefficient

A

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The Silhouette coefficient is a value between -1 and 1, where higher values indicate a better clustering.

a(i) = avg distance of i to other point same cluster
b(i) = avg distance to nearest other cluster

S(i) = b(i) - a(i) / max{b(i), a(i)}

https://medium.com/@MrBam44/how-to-evaluate-the-performance-of-clustering-algorithms-3ba29cad8c03

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Bradley-Fayyed-Reina (BFR)

A

Designed to handle very large datasets.
Assumes that clusters are normally distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly