Clustering Flashcards
Clustering
A data mining technique that involves the grouping of numerous points.
It is a form on unsupervised learning.
What are the types of clusters?
2
1) Hard clusters - a type of clustering where each point belongs to a specific cluster completely or not at all.
2) Soft clusters - a type of clustering where instead of assigning each point to a specific cluster, each point is assigned a probabilty to be placed in a cluster.
What are some of the ways to calculate the distance between members in a cluster?
3
1) Set of vectors - measures similarity using the cosine distance
2) Sets of sets - measure similarity by using the Jaccard distance
3) Sets of points - measures the similarity using Euclidean distance
k-means clustering
Form of unsupervised learning.
Assume k is our number of clusters and we are in Euclidean space.
Algorithm:
1) For each point, place it in the cluster whose current centroid is nearest.
2) After all points are assigned update the locations of centroids for each cluster.
3) Reassign points to the nearest centroid
4) Repeat steps 2 and 3 until convergence happens (stage where points do not move)
Note that we choose k by choosing the kink in the graph. Where x axis is number of k’s and y value is average distance to centroid.
What are the other methods of hierachical clustering?
2
1) Agglomerative (bottom up) - each point is it’s own cluster from there you repeatedly combine the nearest points.
2) Divisive (top down) - start at the top and split recursively until you cannot split anymore.
What is a dendogram?
Slide 225
How do you calcualte cohesion?
It is WSS or within cluster sum of squares.
WSS = SSE= SUM(data points - centroids)^2
BSS = SUM(number of points cluster * distance(centroid, total sample mean))
BSS means the sum of distances between the centroids and the total sample mean multiplied by the number of points within each cluster
Formula (conceptually):
BCSS=∑𝑘=1𝐾𝑛𝑘⋅∥𝜇𝑘−𝜇∥^2
Where:
𝑛𝑘 = number of points in cluster k
k = centroid of cluster
μ = overall mean of all data points.
Suppose we have the following points:
|—-m1—-|———m———|—-m2—-|
1————2———3———4————-5
Do these calculations for k = 1, k = 2 and k = 4. The WSS and BSS calculation.
K = 1
WSS = (1-3)^2+(2-3)^2+(4-3)^2+(5-3)^2 = 10
BSS = 4 * (3-3)^2 = 0
Total = 10 + 0 = 10
K = 2
WSS = (1-1.5)^2+(2-1.5)^2+(4-4.5)^2+(5-4.5)^2 = 1
BSS = 2 * (3-1.5)^2 + 2 * (4.5-3)^2= 9
Total = 1 + 9 = 10
K = 4
WSS = (1-1)^2+(2-2)^2+(4-4)^2+(5-5)^2 = 0
BSS = 1 * (1-3)^2 + 1 * (2-3)^2 + 1 * (4-3)^2 + 1 * (5-3)^2 = 10
Total = 10 + 0 = 10
Silhouette Coefficient
The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The Silhouette coefficient is a value between -1 and 1, where higher values indicate a better clustering.
a(i) = avg distance of i to other point same cluster
b(i) = avg distance to nearest other cluster
S(i) = b(i) - a(i) / max{b(i), a(i)}
https://medium.com/@MrBam44/how-to-evaluate-the-performance-of-clustering-algorithms-3ba29cad8c03
Bradley-Fayyed-Reina (BFR)
Designed to handle very large datasets.
Assumes that clusters are normally distributed.