Clustering - Hierarchical Flashcards
Hierarchical Clustering - steps
Unsupervised learning
Agglomerative - Bottom up approach.
Frist: Select dissimilarity measure and linkage
Steps:
1. Treat each of the n observations as its own cluster
2. For k = n, n-1, … , 2
a. Calculate the inter-cluster dissimilarity between all k clusters
b. The two clusters with the lowest inter-cluster dissimilarity are fused; the dissimilarity indicates the height in the dendrogram at which the two clusters join.
The number of clusters should be chosen after constructing and observing the dendrogram and choosing a suitable height (cut at specific height).
The algorithm only needs to be run once.
Hierarchical Clustering - notes
An inter-cluster dissimilarity measures dissimilarity between two clusters.
A dendrogram is a visual representation of the clustering result. The bottom indicates where all the observations are in their own cluster; the top indicates where all the observations are in one cluster.
Dissimilarity: measures the proximity of two clusters in the data set
-Euclidean distance captures the difference in the variable values
-Correlation-based distance captures the difference in the pattern from variable to variable.
Linkage: Determines how the inter-cluster dissimilarity is calculated, i.e. the way that the observation pairwise dissimilarities are aggregated.
-Complete - largest pairwise dissimilarity
-Single - smallest pairwise dissimilarity
-Average - average pairwise dissimilarity
-Centroid - dissimilarity between centroids
Dendrogram notes
The height values correspond to the values of the linkage function at the points where two clusters are combined in the hierarchical clustering algorithm. When the height difference between adjacent values is small on a relative basis, the observations in the corresponding clusters have similar characteristics as determined by the linkage function and dissimilarity measure, and it makes sense to combine the clusters. On the other hand, when the height difference between adjacent values is large, the observations in the corresponding clusters have materially different characteristics, so combining the clusters would result in a loss of information.
The clustering algorithm starts out with n clusters and fuses them together in an iterative process based on which observations are most similar. The complete linkage method considers the maximum intercluster dissimilarity and the single linkage method uses minimum intercluster dissimilarity. As such, the single method tends to fuse observations on one at a time resulting in less balanced clusters.