Cluster Analysis Flashcards
Cluster analysis is based on and involves
Cluster analysis is based on the concept of similarity. Groups are formed by the pairing of individual cases within a dataset according to how similar they are on either a series of two or more scales or measures.
Cluster analysis techniques
Allow the researcher to examine how cases in the dataset are related to each other across a range of variables
Cluster analysis - distance
Observations are related and paired to each other on the basis of distances, which are defined by the differences between the scores of one observation and corresponding scores of another. Distance (D) is based on the notion of n x v dimensional space, where D = (n, v). D = dimensional space. n = number of observations. v = number of variables.
Euclidean distance
Square root of the sum of squared difference between each score for two observations
Manhattan distance
Sum of the absolute difference between scores in observations
Pearson distance
Square root of the sum of squared difference between observations divided by their variance
Pairing process
Depending on the distance measure and linkage method adopted, a process of agglomeration will take place by adding observations to clusters until just one cluster remains that contains all individual observations. The pairing process continues until one ‘cluster’ has been forumlated from the cases in the dataset.
Single linkage
This is based on the minimum distance between an observation in one cluster and that of another.
Average linkage
This method examines not only the distances between two observations in different clusters but also the distance between the cluster centres.
Centroid linkage
This uses another avergaing technique that attempts to link clusters according to the cluster means.
Complete linkage
This offers a means by which to examine the maximum possible distance between an observation in one cluster and that of another. The clusters tend to be of a relatively similar size and uniformity. This can mean the outliers are more significant, pulling the maximum limits of any given cluster and skewing the result.
Hierarchical clustering
Each case is a seperate cluster, clusters are combined sequentially until one cluster is left or once a case is joined it does not move
Median linkage
Ensures that the median, and not the mean, distance between two clusters provides the distance measure.
Method to evaluate the distances between clusters
Most popular approach is Ward’s method that uses ANOVA to evaluate the distances.
Non hierarchical clustering
Clusters are specified in advanced and cases can move clusters right up until the end of the process.
Most popular non hierarchical clustering approach
k-means clustering - produces clusters with the greatest possible distinction between clusters.
Dendrograms
Diagrammatic representation of the pairing process indicating how many clusters existed at any particular part of the process. Every observation is represented as an individual, therefore clusters and individuals not found to be identical at the first stage are represented by vertical lines.

Cutting of Dendrograms
Cutting is a subjective and somewhat difficult process as the number of clusters one wants to retain for further analysis depends on the research objectives. The deicision may also depend on whether the work is explanatory or confirmatory. Where there is a confirmatory element in the research, the dendrogram might be cut according to the relevant number of clusters sought.
Cautions/Limitations
The subjectivity of how many clusters to retain, no missing data can be permitted, outliers or skewed data can produce problems for clustering, the absence of groups doesn’t mean they exist; and, the presence of groups doesn’t make them meaningful.