Cluster Analysis Flashcards
Under what type of ‘multivariate analysis’ does cluster analysis come under?
Analysis of INTERDEPENDENCE (reducing large amounts of data/variables into more mangeable amounts to enable a more effective analysis)
What is cluster analysis and why is it useful?
This involves turning CASES into CLUSTERS (defo not variables!!)
Individual cases are simplified into fewer groups called ‘clusters’ based on similarities between them. The characteristics of a cluster can then be explored as well as the relationships between them.
A way of grouping large amounts of data to make it more manageable to explore e.g. behavioural data to help companies understand different customer needs
How is similarity/dissimilarity quantified? How does this relate to the aim of cluster analysis?
Through looking at statistical distance.
The aim of cluster analysis is to maximise similarities within a cluser with small distances, and to maximise dissimilarities between different clusters.
Give three key ways of measuring distance between pairs of cases. Give benefits/uses
EUCLIDEAN = straight line distance, most commonly used (square root of sum of squared differenes between each score for 2 observations)
MANHATTAN = ‘city block’ method; sum of absolute differenes; along and up. REDUCES IMPACT OF OUTLIERS.
PEARSON = standardises distance where measures operate over different magnitude ranges; to undertake cluster analysis with data that CROSS A LARGE MAGNITUDE
How would you graphically represent Euclidean and Manhattan distance?
If you imagine a right angle triangle on graph
Euclidean = diagonal/hypotenuse
Manhattan = the opposite and adjacent
Define and imagine drawing diagrams for the different linkage procedures - single, complete, average and centroid.
Why are linkage procedures important?
SINGLE = aka Nearest Neighbour; shortest distance between two members in the two clusters
COMPLETE = aka Furthest Neighbour; longest distance between two members
The choice may influence whether there are outliers or skewed data.
What are the two different approaches to clustering and what methods do they use?
HIERARCHICAL = use Ward’s Method; cases in a single cluster, clusters are sequentially combined until one cluster is left, cases are fixed.
NON-HIERARCHICAL: uses k-means clustering; number of clusters specified in advance, cases can move clusters all the way through.
Outline the processes used to decide on the number of clusters.
DENDROGRAMS = lines are subjectively “cut” in order to have enough clusters that are dissimilar. The longest lines have the largest distances and are likely to connect clusters that are too dissimilar to be in the same cluster, so we cut the line to give two seperate, dissimilar clusters.
AGGLOMOERATION SCHEDULE = look at ‘coefficients’ column in SPSS output table and can look for the large jumps in values
Outline 4 limitations of the cluster analysis process.
- Subjectivity in retaining clusters
- There can be no missing data at all as it relies on calculating distances for all cases
- Outliers or skewed data can produce problems
- The absence of groups does NOT mean that they don’t exist. The presence of groups doesn’t make them meaningful.