K-Means Flashcards
Define monothetic cluster
Members have some common property
Define polythetic cluster
Cluster members are similar to each other
Define hard clustering
Clusters do not overlap
Define soft clustering
Clusters may overlap
Define flat clustering
set of clusters without any explicit structure that would relate clusters to each other
Define hierarchical clustering
Produces a hierarchy of clusters
K-D trees, monothetich or polythetic?
Monothetic, few cuts describe each member of the region
K-D trees, hard or soft boundaries?
Hard
K-D trees, flat or hierarchical?
Hierarchical, we can cut the tree at any depth
K-means, monothetic or polythetic?
Polythetic
K-means, hard or soft boundaries?
Hard
K-means, flat or hierarchical?
Flat
Gaussian mixtures, monothetic or polythetic?
Polythetic
Gaussian mixtures, hard or soft boundaries?
Soft
Gaussian mixtures, flat or hierarchical?
Flat
Agglomerative clustering, monothetic or polythetic?
Polythetic
Agglomerative clustering, hard or soft boundaries?
Hard
Agglomerative clustering, flat or hierarchical?
Hierarchical
What are some common use cases for K-means? (2)
- Discovering classes (unsupervised)
- Dimensionality reduction
How can k-means be used for dimensionality reduction?
Run k-means, replace features with a cluster number
K-means clustering algorithm
- Place centroids c1, …, ck randomly
- Repeat until convergence
- For each point xi
- Find the nearest centroid cj
- Assign point xi to cluster j
- For each cluster, make the centroid cj closest to all the data points in the cluster (for euclidian distances this is the mean)
- For each point xi
What does it mean that K-means converges to a local minimum?
Different starting points can produce difference clusters
Whats the variance for K-means?
Sum of distance to every points centroid
What is a scree plot?
How can you pick a good value of K for k-means? (2)
- class labels may suggest a value (digits recognition - 10 digits)
- optimize V visually from a scree plot (where mountain ends and rubble begins)
How can we extrinsicly excaulate a clustering algorithm?
Use it as part of another problem (e.g. removing outliers for digit recognition), and see if it helped
How can we intrinsicly evaluate a clustering algorithm (with reference clusters)?
- Align reference clusters Rj with system produced clusters Ci
- Measure the accuracy
How do we measure the accuracy in intrinsic clustering evaluation (with reference clusters)?
The sum of the overlapping instances in each pair of system and reference clusters over the number of training instances
How can we align a reference cluster with a system cluster?
- Pick the pair of reference and system clusters with the maximum overlap
- Greedly reassign clusters until each reference cluster has a unique system cluster
How can we intrinsicly evaluate clusters with humans?
- Sample pairs
- Ask humans if they should be in the same cluster
- Count errors and compute accuracy
What is the advantage of intrinsic evaluation with humans vs reference clusters?
- Does not require cluster alignment strategy
- Can handle overlapping classes
How can we use K-Means as part of image recognition?
- Split image into regions
- Compute statistics of regions
- Use K-Means to cluster regions
- Put clusters together like bag-of-words { 4 * “C27”, 15 * “C44”, … }
- Use any algorithm on the flat representation
What does K-Means minimize?
the aggregate intra-cluster distance