Clustering Flashcards
Clustering
- Grouping similar objects/features into classes (clusters)
- Ill-posed: definition of group is vague and arbitraty (clustering based on less important features or more than a single way of classifying)
- Most common form of unsupervised learning
- Definition:
- set of objects
- measure of similarity
- k, desired number of clusters
- outputs an assignment function from objects to one of the k clusters, no cluster is empty, no order between clusters
- Important choiches:
- representation of data, crucial for results (preprocessing, similarity/distance)
- how many clusters (avoid trivial clusters, too big or too small are not useful)
Clustering as search problem
- The goal is often to optimize an objective function, search on a solution space
- K^n/K! possible clusters
- Many local minima in the objective function may result in non-optimal partitions
Assessment of clustering
- Internal criteria, uses objects similarity measure. Intra-class similarity should be high and inter-class similarity should be low [difference with centroid,…]
- External criteria, ground truth given to measure differences. Quality is the ability to recognize some or all hidden patterns in the data [Purity, RandIndex, entropy,…]
Purity
*external evaluation
*ratio between the number of elements of the
dominant class in a cluster and the cardinality of the cluster
RandIndex
- external evaluation
- considers for each pair of examples whether they have been correctly distributed in the clusters according to the gt
- similar to classification accuracy
- RI = (tp+tn)/(tp+fp+tn+fn)
Clustering algorithms
*Partitioning algorithms. Start with initial partition and refines it according to a criterion
-heuristics: K-means clustering
-model-based
clustering
*Hierarchical algorithms
-Bottom-up or agglomerative
-Top-down or divisive
K-means
*For each cluster we minimize the average of the distance between the
examples and the center of the cluster (centroid)
1-Generate K points, tht initial centroids
2-Instances are assigned to clusters based on the similarity/distance of
the examples from the centroids of the current clusters
3- Recalculate the position of the centroids as means of all the examples belonging to the
respective clusters
4 Repeat steps 2 and 3 until the centroids stabilize.
*examples are assumed to be real-valued vectors
Hierarchical algorithms
- useful when optimal number of clusters is not known
- build a dendrogram, tree-based taxonomy
- recursive application of clustering algorithm
- aggregation of clusters (more popular)
Agglomerative Hierarchical Clustering
*Starts from a cluster for each example and merges them gradually until a single cluster is obtained
*the history of merges forms a dendrogram
*How to measure of distance between clusters:
-Single-link: Similarity among the most similar examples in clusters, O(n^2), chaining effect
-Complete-link: Similarity among the most distant examples in the
clusters, O(n^2log2), sensitive to outliers
-Centroid: Similarity between the centroids of the clusters, O(n^2log2), non monotonic
-Average-link: Mean similarity between pairs of examples of clusters, O(n^2log2)
Dendrogram
- y-axis, similarity between samples/clusters
- merge is monotone (similarity decreases)
- a cut on the height of the dendogram corresponds to a clustering