Chapter 10 - Clustering Flashcards
Clustering - overviewa and types we discussed in class
unsupervised learning, used to help understand latent trends in data. we assign a class to each sample in the data matrix, however this class is not an output variable. Should we scale before clustering? Does Euclidean distance capture dissimilarity? type - K-means, Hierarchical
K-means
different from classification because there is no output variable (e.g., knowing height and weight but not gender)/ K is the number of clusters and must be fixed in advance. The goal of K-means is to {minimize the dissimilarity,maximize the similarity} within each cluster. min sum W(C_L) . Follow the process: 1) assign each sample to a K 2a) compute the centroid for each group 2b) reassign groups to nearest centroid [iterate until centroids don’t move].
Properties of K-means algorithm (4)
1) will always converge to a local minimum of W(C_L), not necessarily a global minimum
2) each step minimizes the distance to the centroid
3) each initialization will yield a different local min
4) in practice, we choose the k-means algorithm run which minimizes the objective function (you can run k-means several times)
Hierachical clustering
agglomerative. we do not have to commit to a choice of K clusters. bottom up clustering. the output of hierarchical clustering is a dendogram. y axis tells you distance between clustered groups. Hierarchical clustering is not always appropriate (gender and nationality are not nested or hierarchical).
4 types of linkage used in hierarchical clustering (and any potential downfalls of each type)
1) complete linkage - maximal intercluster dissimilarity
2) single linkage - minimal intercluster dissimilarity (suffers from chaining phenomenon)
3) average linkage - mean intercluster dissimilarity
4) centroid linkage - dissimilarity between centroid for cluster A and cluster B (can lead to inversions)
Choice of dissimilarity measure
the default is euclidean distance. correlation distance would group customers that purchase similar things, not just magnitude. mahalanobis distance: removes the effect of two different perturbations in the data that are correlated between many and only a few factors (weights both perturbations the same).