W8: Cluster Analysis Flashcards by Gitanjali Sharma

Cluster analysis can be used to classify

individuals and separate them for further study

How well did you know this?

Not at all

Perfectly

In cluster analysis we find groups of (2)

similar individuals based on their covariate information
These groups are known as clusters

How well did you know this?

Not at all

Perfectly

Aim to extract a small number of cluster of individuals who share similar characteristics and who have

different characteristics than those in other clusters

How well did you know this?

Not at all

Perfectly

How can we measure the degree of similarity between individual’s scores across a number of variables?

Using 2 measures: similarity coefficients and dissimilarity coefificents

How well did you know this?

Not at all

Perfectly

The correlation coefficient ,r, is a measure of

similarlity between 2 variables

How well did you know this?

Not at all

Perfectly

What does the Pearson’s correlation correlation tell us

whether 1 variable changes the other by a similar amount

How well did you know this?

Not at all

Perfectly

We could use the Pearson’s correlation coefficient, r, to work out the correlation between 2

individuals

How well did you know this?

Not at all

Perfectly

We could use the Pearson’s correlation coefficient, r, to work out the correlation between 2

individuals

How well did you know this?

Not at all

Perfectly

However, although the correlation tells us whether the pattern of responses between people are similar

It does not tell us anything about the

distance between 2 individual profiles

How well did you know this?

Not at all

Perfectly

An alternative measure compared to Pearson’s correlation coefficient , r, is

Euclidean distance

How well did you know this?

Not at all

Perfectly

What is Euclidean distance?

Geometric distance between 2 individuals

How well did you know this?

Not at all

Perfectly

The Euclidean distance for individual i and j formula given below:

How well did you know this?

Not at all

Perfectly

With Euclidean distance, the smaller the distance

the more similar the individuals

How well did you know this?

Not at all

Perfectly

Euclidean distances are heavily affected by

variables with large size

How well did you know this?

Not at all

Perfectly

So if cases are being compared across variables that have different variances, then

Euclidean distances will be inaccurate

How well did you know this?

Not at all

Perfectly

In such case that Euclidean distances will be inaccurate if cases compared have different variances then (2)

May standardise the scores by subtracting the mean of each variable and dividng by SD

(value - mean)/SD

How well did you know this?

Not at all

Perfectly

How to calculate SD on your calculator? (8)

Mode
Statistics (2)
1
Input values
OPTN
1 variable (3)

How well did you know this?

Not at all

Perfectly

Most methods of grouping individuals based on similarlity are done in hierarchical way (2)

Begin all individuals treated as one cluster
At each subsequent stage clusters merged based on Euclidean distance

How well did you know this?

Not at all

Perfectly

For example calculating Eucliden distances we first say

Study These Flashcards

We calculate the Eucliden distance between the individuals

In Eucliden distance, n represents

Study These Flashcards

number of variables

What is n here in calculating Euclidean distance?

Study These Flashcards

For first individuals lets say A and B we say in the formula before calculating:

Study These Flashcards

For A and B:

At the end of Euclidean distance
lets say A and B = 3.26
A and C = 2.13
B and C = 1.23

Ending sentence:

Study These Flashcards

So we would cluster B and C first, then A and C and finally A and B

Consider 3 ways of merging clusrers at each step of method (3)

Study These Flashcards

Nearest neighbour
Furthest neighbour
Average linkage

What is nearest neighbour?

add individuals to clusters one at a time based on the lowest Euclidean distance to any member of the cluster.

What is furthest neighbour?

add individuals to clusters one at a time based on the lowest Euclidean distances to all members of the cluster.

What is average linkage?

add individuals to clusters one at a time based on the lowest average Euclidean distance to the cluster.

What is a dendogram?

It shows how and when individuals are combined in the clustering algorithm

Interpret this denodgram that used nearest neighbour clustering (5)

We see clusters forming: First one is between individuals 1,4,11,7 and 13 Second one is between individuals 10,12,9,15 and 2 Third one is between 6,8,5,3 and 14 First and second clusters are most closely linked than the third cluster

In this table, interpret it: Measured trait anxiety, depression, intrusive thoughts, impulsive Patients with same disorder should report similar pattern across scores We asked 2 psychologist to agree on diagnosis of GAD, depression or OCD

We see an exact mapping between the 3 clusters and the diagnosis of the psychologists in spite of clustering algorithm having no knowledge of the diagnoses

Interpret the FN as compared to NN

We end up with same 3 clusters although individuals are combined in different order

Which of the methods provides a more useful dendogram? Justify your answer (2)

The furthest neighbour clustering provides a more useful dendogram. It forms a more natural set of clusters than the nearest neighbour clustering algorithm, which produces lots of clusters of size 1.

Using dendogram you selected (a) what distance should we cut the dendogram? (B) What is the memberships of the resulting clusters? - (5)

Using the dendogram from the furthest neighbour clustering, we would cut the dendogram somwhere between a distance of around 15 and 16. The resulting cluster memberships would be Cluster a: Ernie, Carla, Christa, Ernest, Christopher, Beulah, Linette, Marie, Bo. Cluster b: Tony, Martina, Randolph, Raul, Catalina, Louis, Sunila, Johnson, Mickey. Cluster c: Rosalyn, Lawrence

We have collected data on the salary, rank (from 1 to 5, where 5 is most senior), FTE (hours worked, where 1 is full time), number of articles published and years of experience of 20 university staff. (c) Discuss cluter memberships in list of summary statistics and plots (4)

From the boxplots and classification table we can see that: Cluster 1 is made up of individuals with a high salary, a large number of articles and many years of experience, all of whom are professors. It appears that cluster 1 is the most senior academics. This correspondes to cluster c above. Cluster 2 is made up of individuals with a lower salary, fewer articles and less experience than cluster 1. However, most are still professors. Therefore, cluster 2 is made up of a slightly lower level of senior academics. Cluster 3 is made up of individuals who have relatively low salaries, few articles and less experience, and who are not yet full professors. That is, cluster 3 is made up of early career academics.

In SPSS what does this table tell us?

- This table tells us which cluster each of my individuals belong to in the cluster membership table

In SPSS how many clusters there are in this dendogram?

3 clusters

Clusters in R Dendogram how may clusters?

3 clusters

Since variables are on the same scale there is no need to

standardise them

Interpret this dendogram - (2) dendogram shows different aspect of disgust - FN

Three broad clusters appear, which seem to be distinct types of disgust. We may want to further disaggregate into up to seven clusters.

Interpret this dendogram - (2) dendogram shows different aspect of disgust - NN

Clustering of bread, deer, crisp and foxes, but other clustering is hard to distinguish.

Which of the two clustering methods used appears to provide the more interpretable clusters? from different aspect of disgust?

Furthest Neighbour

Data on 7 different measurements of 41 cities (c) What do you observe from NN dendogram?

The clustering is a little tricky to make out, although Chicago appears different to the other cities.

Now perform furthest neighbour clustering on the data. What conclusions can you make this time? Are they different to the nearest neighbour algorithm?

This time, two/three large clusters appear, plus a cluster with just Chicago.

Based on the furthest neighbour clustering, find the cluster membership for a 3 cluster solution. Are Wichita and St Louis in the same cluster? Which city is the “odd one out”?

Wichita is in cluster 1 and St Louis in cluster 2. Chicago is the odd one out.

Interpret this dendogram - is it similar or different to previous NN/FN

The results are different again. Chicago is clustered with Philadelphia this time.

W8: Cluster Analysis Flashcards

(45 cards)