Clustering Flashcards
Basic Clustering problem
Given set of objects X, then partition them into groups.
Inter-object distance D(X,Y) = D(Y,X)
Hierarchical Agglomerative clustering
starts with treating each observation as an individual cluster, and then iteratively(n-k) merges clusters until all the data points are merged into a single cluster. Dendrograms are used to represent hierarchical clustering results.
Clusters are merged based on the distance between them and to calculate the distance between the clusters we have different types of linkages.
What is the Linkage Criteria?
It determines the distance between sets of observations as a function of the pairwise distance between observations.
Single Linkage -> minimum distance between member of two cluster (produces long chains)
Complete Linkage -> maximum distance (spherical clusters with consistent diameter)
Average Linkage -> the average of all distances (less affected by outliers)
Centroid Linkage -> the distance between their centroids (In Euclidean space)
What is the Running time of SLC?
- It is deterministic. Doesn’t require randomize optimization
- If you’re doing this process in a space where distance represent edge length of the graph, then it the same as minimum spanning tree algorithm.
- Runtime friendly o(N^3)
- could be more clever if we store these distance
repeat k times (N/2)
look at all the distances to find closest pair o(n^2)
that have different labels .
Could we do SLC linear ?
No, could be little down than O^3 for small and very big k. reduce time to o(N^2 log N) if we using priority queue
Issue with SLC ?
premature merging of groups with close pairs, even if those groups are quite dissimilar overall.
(chaining phenomenon) .
No working fine with large dataset. but o(N^3)
How K-mean work?
1) Pick k centers at random
2) Each center claims its closest points
3) Re-compute the centers by averaging the clustered points
4) Repeat until converge
Is the center of k-mean should be a point in the data?
and how to picking the initial k points?
Not necessary.
1) sampling (sample the data using Hierarchical clustering, then picking point from each cluster) ,
2) pick dispersed set of points. (first point random, afterward, pick whose min distance as large as possible )
How to optimize K-mean ?
It looks like hill climbing, it evaluate the score and choose the neighbor that maximize the score.
* ) can only go down or stay the same.
How to pick the right k for k-means?
Elbow curve, the smallest value for average distance to centroid before converge.
What the complexity of K-means?
*) In each round, we have to examine each input point exactly once to find closest centroid.
- ) Each round O(KN) from N points, K clusters.
- ) But he number of rounds to convergence can be very large O(K^N)
In km-means, Why when we move the center could be up?
because average is the best way to represent set of points. (Average minimized squared error)
Monotonically non-increasing in error!
How K-mean converge?
There are finite number of configuration, because there are finite number of object, and there is finite number of label they could have, once you’ve chosen a label for a bunch of object, the centers are defined deterministically from that .
you have to have some way to beak tie. I will never back to old configuration. then I will be run out of configuration.
There are K^N configuration.
Finite number of configuration -> Never getting worse in error metrics –> you have some way of breaking ties -> you have to stop -> this how to get convergence
Can K-mean Stuck?
Yes. depend on the first random point chosen. (Local optimal)
We could avoid that by repeat it. (Random restarts).
There are smart way to choose the start points.
Is k-mean hard of soft clustering ?
No, Hard clustering.