Cluster Analysis Flashcards
Cluster analysis is based on and involves
Cluster analysis is based on the concept of similarity. Groups are formed by the pairing of individual cases within a dataset according to how similar they are on either a series of two or more scales or measures.
Cluster analysis techniques
Allow the researcher to examine how cases in the dataset are related to each other across a range of variables
Cluster analysis - distance
Observations are related and paired to each other on the basis of distances, which are defined by the differences between the scores of one observation and corresponding scores of another. Distance (D) is based on the notion of n x v dimensional space, where D = (n, v). D = dimensional space. n = number of observations. v = number of variables.
Euclidean distance
Square root of the sum of squared difference between each score for two observations
Manhattan distance
Sum of the absolute difference between scores in observations
Pearson distance
Square root of the sum of squared difference between observations divided by their variance
Pairing process
Depending on the distance measure and linkage method adopted, a process of agglomeration will take place by adding observations to clusters until just one cluster remains that contains all individual observations. The pairing process continues until one ‘cluster’ has been forumlated from the cases in the dataset.
Single linkage
This is based on the minimum distance between an observation in one cluster and that of another.
Average linkage
This method examines not only the distances between two observations in different clusters but also the distance between the cluster centres.
Centroid linkage
This uses another avergaing technique that attempts to link clusters according to the cluster means.
Complete linkage
This offers a means by which to examine the maximum possible distance between an observation in one cluster and that of another. The clusters tend to be of a relatively similar size and uniformity. This can mean the outliers are more significant, pulling the maximum limits of any given cluster and skewing the result.
Hierarchical clustering
Each case is a seperate cluster, clusters are combined sequentially until one cluster is left or once a case is joined it does not move
Median linkage
Ensures that the median, and not the mean, distance between two clusters provides the distance measure.
Method to evaluate the distances between clusters
Most popular approach is Ward’s method that uses ANOVA to evaluate the distances.
Non hierarchical clustering
Clusters are specified in advanced and cases can move clusters right up until the end of the process.