Clustering Flashcards
What is clustering?
Clustering is an unsupervised learning method. It aims to partition a dataset into disjoint subsets, where each subset is called a cluster. Each cluster potentially corresponds to a concept (category) . Clustering algorithms are unaware of such concepts before clustering and are only responsible for creating the clusters. Clustering can be used to identify inherent structure of data and also serves as a pre-processing technique for other learning tasks such as classification. (Do check the formal definition in the notes!)
What are the performance measures for clustering called? How does a good cluster look like?
Validity indices. We seek clusters with intra-cluster similarity and low-inter cluster similarity.
What are External Validity Indices?
These compare clustering results against a reference or ground truth.
What are some example metrics of External Validity Indices?
- Jaccard Coefficient (JC): Measures similarity between two data sets.
- Fowlkes and Mallows Index (FMI): Combines precision and recall for clustering.
- Rand Index (RI): Compares all pairs of data points and checks if they are correctly clustered.
What are Internal Validity Indices?
These evaluate the quality of clustering without a reference or ground truth.
What are some example metrics of Internal Validity Indices?
- Average Distance in a Cluster.
- Cluster diameter (largest distance between two points).
- Distance Between closest points in two clusters.
- Distance Between Cluster Centroids.
- Davies-Bouldin Index (DBI): Lower DBI => better clustering. Evaluates intra-cluster similarity (how compact each cluster is) and inter-cluster separation (how distinct clusters are from one another).
- Dunn Index (DI): Higher DI => better clustering. DI evaluates the ratio between the minimum distance between clusters (inter-cluster separation) and the maximum diameter of any cluster (intra-cluster dispersion).
Not a Flash Card. Go read DISTANCE CALCULATION
!
What are continuous attributes? What are categorical attributes? What are ordinal and non-ordinal attributes? What are non-metric distances?
Respectively: Infinite domains, Finite domains, Categorical attributes that look like continuous attributes, Directly calculated with the attribute values, Distances not satisfying the rule of subaddivity.
CHECK THE ALGORITHM FOR K-MEANS CLUSTERING
What is Prototype Clustering?
Family of clustering algorithms that assumes the clustering structure can be represented by a set of prototypes.