W8: Cluster Analysis Flashcards
Cluster analysis can be used to classify
individuals and separate them for further study
In cluster analysis we find groups of (2)
similar individuals based on their covariate information
These groups are known as clusters
Aim to extract a small number of cluster of individuals who share similar characteristics and who have
different characteristics than those in other clusters
How can we measure the degree of similarity between individual’s scores across a number of variables?
Using 2 measures: similarity coefficients and dissimilarity coefificents
The correlation coefficient ,r, is a measure of
similarlity between 2 variables
What does the Pearson’s correlation correlation tell us
whether 1 variable changes the other by a similar amount
We could use the Pearson’s correlation coefficient, r, to work out the correlation between 2
individuals
We could use the Pearson’s correlation coefficient, r, to work out the correlation between 2
individuals
However, although the correlation tells us whether the pattern of responses between people are similar
It does not tell us anything about the
distance between 2 individual profiles
An alternative measure compared to Pearson’s correlation coefficient , r, is
Euclidean distance
What is Euclidean distance?
Geometric distance between 2 individuals
The Euclidean distance for individual i and j formula given below:
With Euclidean distance, the smaller the distance
the more similar the individuals
Euclidean distances are heavily affected by
variables with large size
So if cases are being compared across variables that have different variances, then
Euclidean distances will be inaccurate
In such case that Euclidean distances will be inaccurate if cases compared have different variances then (2)
May standardise the scores by subtracting the mean of each variable and dividng by SD
(value - mean)/SD
How to calculate SD on your calculator? (8)
Mode
Statistics (2)
1
Input values
OPTN
1 variable (3)
Most methods of grouping individuals based on similarlity are done in hierarchical way (2)
- Begin all individuals treated as one cluster
- At each subsequent stage clusters merged based on Euclidean distance
For example calculating Eucliden distances we first say
We calculate the Eucliden distance between the individuals
In Eucliden distance, n represents
number of variables
What is n here in calculating Euclidean distance?
4
For first individuals lets say A and B we say in the formula before calculating:
For A and B:
At the end of Euclidean distance
lets say A and B = 3.26
A and C = 2.13
B and C = 1.23
Ending sentence:
So we would cluster B and C first, then A and C and finally A and B
Consider 3 ways of merging clusrers at each step of method (3)
- Nearest neighbour
- Furthest neighbour
- Average linkage
What is nearest neighbour?
add individuals to clusters one at a time based
on the lowest Euclidean distance to any member of the cluster.
What is furthest neighbour?
add individuals to clusters one at a time based
on the lowest Euclidean distances to all members of the cluster.
What is average linkage?
add individuals to clusters one at a time based on
the lowest average Euclidean distance to the cluster.
What is a dendogram?
It shows how and when individuals are combined in the clustering algorithm
Interpret this denodgram that used nearest neighbour clustering (5)
We see clusters forming:
First one is between individuals 1,4,11,7 and 13
Second one is between individuals 10,12,9,15 and 2
Third one is between 6,8,5,3 and 14
First and second clusters are most closely linked than the third cluster
In this table, interpret it:
Measured trait anxiety, depression, intrusive thoughts, impulsive
Patients with same disorder should report similar pattern across scores
We asked 2 psychologist to agree on diagnosis of GAD, depression or OCD
We see an exact mapping between the 3 clusters and the diagnosis of the psychologists in spite of clustering algorithm having no knowledge of the diagnoses
Interpret the FN as compared to NN
We end up with same 3 clusters although individuals are combined in different order
Which of the methods provides a more useful dendogram? Justify
your answer (2)
The furthest neighbour clustering provides a more useful dendogram.
It forms a more natural set of clusters than the nearest neighbour
clustering algorithm, which produces lots of clusters of size 1.
Using dendogram you selected
(a) what distance should we cut the dendogram?
(B) What is the memberships of the resulting clusters? - (5)
Using the dendogram from the furthest neighbour clustering, we
would cut the dendogram somwhere between a distance of around
15 and 16.
The resulting cluster memberships would be
Cluster a: Ernie, Carla, Christa, Ernest, Christopher, Beulah, Linette, Marie, Bo.
Cluster b: Tony, Martina, Randolph, Raul, Catalina, Louis, Sunila,
Johnson, Mickey.
Cluster c: Rosalyn, Lawrence
We have collected data on the salary, rank (from 1 to 5, where 5 is most
senior), FTE (hours worked, where 1 is full time), number of articles
published and years of experience of 20 university staff.
(c) Discuss cluter memberships in list of summary statistics and plots (4)
From the boxplots and classification table we can see that:
Cluster 1 is made up of individuals with a high salary, a large
number of articles and many years of experience, all of whom are
professors. It appears that cluster 1 is the most senior academics.
This correspondes to cluster c above.
Cluster 2 is made up of individuals with a lower salary, fewer articles and less experience than cluster 1. However, most are still professors. Therefore, cluster 2 is made up of a slightly lower level
of senior academics.
Cluster 3 is made up of individuals who have relatively low salaries,
few articles and less experience, and who are not yet full professors.
That is, cluster 3 is made up of early career academics.
In SPSS what does this table tell us?
- This table tells us which cluster each of my individuals belong to in the cluster membership table
In SPSS how many clusters there are in this dendogram?
3 clusters
Clusters in R Dendogram how may clusters?
3 clusters
Since variables are on the same scale there is no need to
standardise them
Interpret this dendogram - (2) dendogram shows different aspect of disgust - FN
Three broad clusters appear, which seem to be distinct types of disgust.
We may want to further disaggregate into up to seven clusters.
Interpret this dendogram - (2) dendogram shows different aspect of disgust - NN
Clustering of bread, deer, crisp and foxes, but other clustering is hard to distinguish.
Which of the two clustering methods used appears to provide the more interpretable
clusters? from different aspect of disgust?
Furthest Neighbour
Data on 7 different measurements of 41 cities
(c) What do you observe from NN dendogram?
The clustering is a little tricky to make out, although Chicago appears different to the other cities.
Now perform furthest neighbour clustering on the data. What conclusions can you
make this time? Are they different to the nearest neighbour algorithm?
This time, two/three large clusters appear, plus a cluster with just Chicago.
Based on the furthest neighbour clustering, find the cluster membership for a 3
cluster solution. Are Wichita and St Louis in the same cluster? Which city is the
“odd one out”?
Wichita is in cluster 1 and St Louis in cluster 2. Chicago is the odd one out.
Interpret this dendogram - is it similar or different to previous NN/FN
The results are different again. Chicago is clustered with Philadelphia this time.