Representative Based Flashcards
Representative Based
N instances, k clusters, partition C of N in K
centroid_i = (1/n_i) * sum(xj)
where n_i = |Ci|
Brute Force
All possible clustering
Select best one
O(k^N/k!)
K-means
Greedy iterative approach SSE = sum(sum(||xj-centroid_i||^2)) Cā = arg min SSE Most widely know rep. Complexity O(nkd), centroid recompute O(nd)
K-means Centroid init
1) Pick points far away
2) First point at random. add point larga distance
3) Hierarchically so k clusters/ Pick point of each cluster
4) more than K centroids and then merge
P = K! / K^K
Bisecting
one cluster, then divide, then divide.
Updating Center incrementally
After each assignment
More expensive
Pre processing
Normalize data
Eliminate outliers
Post processing
Eliminate small clusters
Split loose clusters
Merge clusters closer
Problems of Kmeans
Densities Non globular shapes Sizes Need to specify k Sensitive to outliers
BFR Algorithm
Very large data sets
Clusters normally distributed around centroid
Axis aligned ellipses
O(clusters) and not O(data)
BFR Steps
K random points
Small random sample and cluster optimally
Take sample, pick random point, and then k-1 more points, as far from previous
Discard Set, Compression Set, Retained Set
Mean Shift Clustering
Iterative, non parametric algorithm It searches for the mode Find densest region Number of modes gives number of clusters Can handle arbitrarily shaped clusters Robust to initializations O(T*n^2) expensive
Mean Shift Pseudo
1) choose bandwidthh
2) initial location of search window
3) compute mean location
4) center search window point 3
5) repeat 3 and 4
Expectation Max - Clustering
soft assignment. each point has a probability
Mean vector u_i, covariance matrix E_i
phi = {u_i * E_i * P(C_i)}
Goal: maximize phi
Density Based
Arbitrary Shape
Handle noise
One Scan
Neighborhood within a radius E
Core object: if E-neig contains at least minpts
Direct density reachable from y if x is within e-neigh
Density Reachable, Density Connected.