Clustering Flashcards

1
Q

Clustering

A
  • Grouping similar objects/features into classes (clusters)
  • Ill-posed: definition of group is vague and arbitraty (clustering based on less important features or more than a single way of classifying)
  • Most common form of unsupervised learning
  • Definition:
  • set of objects
  • measure of similarity
  • k, desired number of clusters
  • outputs an assignment function from objects to one of the k clusters, no cluster is empty, no order between clusters
  • Important choiches:
  • representation of data, crucial for results (preprocessing, similarity/distance)
  • how many clusters (avoid trivial clusters, too big or too small are not useful)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Clustering as search problem

A
  • The goal is often to optimize an objective function, search on a solution space
  • K^n/K! possible clusters
  • Many local minima in the objective function may result in non-optimal partitions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Assessment of clustering

A
  • Internal criteria, uses objects similarity measure. Intra-class similarity should be high and inter-class similarity should be low [difference with centroid,…]
  • External criteria, ground truth given to measure differences. Quality is the ability to recognize some or all hidden patterns in the data [Purity, RandIndex, entropy,…]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Purity

A

*external evaluation
*ratio between the number of elements of the
dominant class in a cluster and the cardinality of the cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

RandIndex

A
  • external evaluation
  • considers for each pair of examples whether they have been correctly distributed in the clusters according to the gt
  • similar to classification accuracy
  • RI = (tp+tn)/(tp+fp+tn+fn)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Clustering algorithms

A

*Partitioning algorithms. Start with initial partition and refines it according to a criterion
-heuristics: K-means clustering
-model-based
clustering
*Hierarchical algorithms
-Bottom-up or agglomerative
-Top-down or divisive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

K-means

A

*For each cluster we minimize the average of the distance between the
examples and the center of the cluster (centroid)
1-Generate K points, tht initial centroids
2-Instances are assigned to clusters based on the similarity/distance of
the examples from the centroids of the current clusters
3- Recalculate the position of the centroids as means of all the examples belonging to the
respective clusters
4 Repeat steps 2 and 3 until the centroids stabilize.
*examples are assumed to be real-valued vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Hierarchical algorithms

A
  • useful when optimal number of clusters is not known
  • build a dendrogram, tree-based taxonomy
  • recursive application of clustering algorithm
  • aggregation of clusters (more popular)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Agglomerative Hierarchical Clustering

A

*Starts from a cluster for each example and merges them gradually until a single cluster is obtained
*the history of merges forms a dendrogram
*How to measure of distance between clusters:
-Single-link: Similarity among the most similar examples in clusters, O(n^2), chaining effect
-Complete-link: Similarity among the most distant examples in the
clusters, O(n^2log2), sensitive to outliers
-Centroid: Similarity between the centroids of the clusters, O(n^2log2), non monotonic
-Average-link: Mean similarity between pairs of examples of clusters, O(n^2log2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Dendrogram

A
  • y-axis, similarity between samples/clusters
  • merge is monotone (similarity decreases)
  • a cut on the height of the dendogram corresponds to a clustering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly