Clustering Flashcards

Question 1

Q

Clustering

Answer

A

Grouping similar objects/features into classes (clusters)
Ill-posed: definition of group is vague and arbitraty (clustering based on less important features or more than a single way of classifying)
Most common form of unsupervised learning
Definition:
set of objects
measure of similarity
k, desired number of clusters
outputs an assignment function from objects to one of the k clusters, no cluster is empty, no order between clusters
Important choiches:
representation of data, crucial for results (preprocessing, similarity/distance)
how many clusters (avoid trivial clusters, too big or too small are not useful)

Question 2

Q

Clustering as search problem

Answer

A

The goal is often to optimize an objective function, search on a solution space
K^n/K! possible clusters
Many local minima in the objective function may result in non-optimal partitions

Question 3

Q

Assessment of clustering

Answer

A

Internal criteria, uses objects similarity measure. Intra-class similarity should be high and inter-class similarity should be low [difference with centroid,…]
External criteria, ground truth given to measure differences. Quality is the ability to recognize some or all hidden patterns in the data [Purity, RandIndex, entropy,…]

Question 4

Q

Purity

Answer

A

*external evaluation
*ratio between the number of elements of the
dominant class in a cluster and the cardinality of the cluster

Question 5

Q

RandIndex

Answer

A

external evaluation
considers for each pair of examples whether they have been correctly distributed in the clusters according to the gt
similar to classification accuracy
RI = (tp+tn)/(tp+fp+tn+fn)

Question 6

Q

Clustering algorithms

Answer

A

*Partitioning algorithms. Start with initial partition and refines it according to a criterion
-heuristics: K-means clustering
-model-based
clustering
*Hierarchical algorithms
-Bottom-up or agglomerative
-Top-down or divisive

Question 7

Q

K-means

Answer

A

*For each cluster we minimize the average of the distance between the
examples and the center of the cluster (centroid)
1-Generate K points, tht initial centroids
2-Instances are assigned to clusters based on the similarity/distance of
the examples from the centroids of the current clusters
3- Recalculate the position of the centroids as means of all the examples belonging to the
respective clusters
4 Repeat steps 2 and 3 until the centroids stabilize.
*examples are assumed to be real-valued vectors

Question 8

Q

Hierarchical algorithms

Answer

A

useful when optimal number of clusters is not known
build a dendrogram, tree-based taxonomy
recursive application of clustering algorithm
aggregation of clusters (more popular)

Question 9

Q

Agglomerative Hierarchical Clustering

Answer

A

*Starts from a cluster for each example and merges them gradually until a single cluster is obtained
*the history of merges forms a dendrogram
*How to measure of distance between clusters:
-Single-link: Similarity among the most similar examples in clusters, O(n^2), chaining effect
-Complete-link: Similarity among the most distant examples in the
clusters, O(n^2log2), sensitive to outliers
-Centroid: Similarity between the centroids of the clusters, O(n^2log2), non monotonic
-Average-link: Mean similarity between pairs of examples of clusters, O(n^2log2)

Question 10

Q

Dendrogram

Answer

A

y-axis, similarity between samples/clusters
merge is monotone (similarity decreases)
a cut on the height of the dendogram corresponds to a clustering

Clustering Flashcards

(10 cards)