multi-dimensional scaling and Cluster analysis Flashcards
why are we covering both CA and MD techniques in one class?
because they are complimentary to eachother
can merge the two techniques and use them together
what is CA and MDS?
Two types of exploratory techniques
- help us to understand and locate structure and relationships in the data
- groups objects together based on their characteristics
- looks for patterns of information
whats the difference between factor analysis and cluster analysis/MDS
FA
- starts with individual variables and reduce these into dimensions of factors
- different ways to run factor analysis. -look at the correlation structure and try to reduce it using the factor loadings
- interpret where the dimensions are based on how individual variables load on these actors
cluster analysis/MDS
- start again with individual variables
- then determine which ones go together
difference - we don’t extract dimensions, instead just try to determine which variables in the dataset go together. this is something YOU do. you aren’t presented with extracted factors your’e only presented with patterns of how things might go together and then you decide which go together.
in which discipline would you use cluster analysis
Used in almost every discipline: psychology, neuroscience, biology etc.,
sometimes we need to sort variables together
the criteria we use to do the sorting will affect the outcome of the sorted variables
what is cluster analysis
Humans are good at identifying patterns - e.g., just looking ta the residual plot reveals a pattern
very difficult to identify patterns mathematically
CA provides you with information that you can use to identify what the patterns are. human-machine work together
what is a dissimilarity matrix
where the larger the number the more dissimilar our two objects (e.g., the di stance between two cities)
what is a similarity matrix and can you give an example of this
where the larger number indicates two objects are more similar e.g., a correlation table
what do we need to be aware of when running cluster analysis with regrads to the matrix
whether it is a dissimilarity or similarity matrix
what does cluster analysis actually do in terms of points of data
it puts points that are most similar together and pushes points most dissimilar apart
clusters things together
what different techniques are used to cluster things together
- k means clustering - non-hierarchical method. you decide in the beginning how many clusters you want. run it then get a suggested membership of data points to clusters
what is k means clustering?
a non-hierarchical clustering method
- we pick some starting cluster numbers - e.g., I want 3 clusters
- algorithm starts by randomly picking 3 cluster points in your data set
- at each step - clustering algorithm calculates the distance between each data point and the cluster center and assigns each datapoint membership to the cluster group nearest
- THEN - cluster center is moved by a certain algorithm - calculates whether this improved the distance measure between all data points and the cluster center
so the goal is to do iterative procedure to
- find the cluster center
- having the goal number (e.g., 3)
- and find the position of those cluster centers that will minimise the distance of all data points that could be assigned to that cluster
Explain what’s happening in this k means clustering shite
Well a cluster of 3 has been identified, the three points have been shifted 4 times to find the ideal location for data points closest to the cluster
with k means clustering what is shifted around the screen - the data points to find those fitting best to the clusters OR the cluster points moving until the data points are closest to it
The data points STAY PUT its the cluster point that shifts bit by bit and stops when the data points for the desired number of clusters are closest
with k means clustering, if the cluster centeroid shifts far enough, is it possible for data points to be assigned a different cluster membership
yes
when does k means clustering stop
when any further change in the cluster center doesn’t reduce the differences anymore.
what does the p-value in cluster analysis tell us
there is no p-value nor end statistic of any sort. you are only presented with, e.g., for. k means clustering, a suggestion of cluster membership for different data points
describe non-hierarchial cluster analysis
non-hierarchial methods
- where clusters are formed by assigning membership to clusters
- you decide how many clusters you want before the analysis e.g., k means cluster
- individual data poitns are assigned one of them according to some particular criteria
in non hierarchial cluster analysis how might you decide on the number of clusters?
- have a certain theory
- use previous literature - look the number they used
- run it with varying numbers e.g., 2-5 then see which one gives the most reasonable cluster groups
hierarchial method for cluster analysis: what are the two groups?
- agglomorative method
- divisive method
in any hierarchical method it goes from 1 to many clusters or many to 1. typically presented either a dendrogram or an icicle plot. then YOU determine themeaningful number of clusters is using a cut off.
in both cases you get a tree diagram (Dendrogram) and an icicle plot -helpful in deciding a feasable cut off point
hierarchial cluster analysis: agglomerative methods
different types: single link (nearest neighbour), maximum link (furthest neighbour) or average link (centeroid clustering) - they differin the way they compute the distances
- start by treating each data point as a one-member cluster
- then proceed to put things together - agglomerate clusters
- once a pair of object shave been put together - cant split them up again
- means new clusters are formed based on clusters already created at a previous step
hierarchical cluster analysis: devisive methods
- treat all data points as one giant cluster
- then split things up - once a pair has been seperated they can never join again
what is the single link aka nearest neighbour technique
one method of hierarchial agglomerative clustering
- start with each city by itself
- then start amalgamating them
- looks at the data finds the points with the closest relationship to each other (durham-subderland) and group these together in a cluster
- then distance matrix recalculated and finds the cities that are next in line closest together (including the Durham-Sunderland cluster as one)
single link aka nearest neighbour technique
look at the dendrogram and name are the cluster groups in order
- durham and sunderland
- exeter and plymouth
- birmingham + (Exeter + plymouth)
horizontal axis is a measure. of the relative proximity of the variables e.g., relationship between Durham and sunderland is closer than the relationship between exeter and plymouth. knowing the relative distance between cities can help you to create a cut off point (e.g., cut off point at about 3 in the x axis number line would give. usonly 1 cluster, nut if it was at 24 we would have 3
single link aka nearest neighbour technique
so after Durham and Sunderland form 1 cluster, SPSS recalibrates and computes a NEW dissimilarity matrix. how does it do this?
when we compare other cities e.g., Exeter to these two cities which distance is used in the dissimilarity matrix.
whichever gives us the smaller value in this case the Durham-Exeter distance. Exeter is the closest link.
This is the matrix used to make the second clustering decision - and we see the smallest value in this table is the exeter-plymouth link
single link aka nearest neighbour technique
lets say in our dissimilarity matrix were comparing the distance between a cluster of 2 cities anda nother cluster of 2 cities - how do we decde the distance value to use in the dissimilarity matrix?
(A, B) vs (1,2)
compare a with 1 and 2, B with 1 and 2, compare 1 with a and b, compare. 2 with a and b
whichever of these gives us the smalles value we use that in the matrix
keep giong until last 2 clusters form 1
Single link aka nearest neighbour technique
looking at the dendrogram how do we now decide how many clusters to have
is it 2 or 3? its a judgment call YOU make the decision
Maximum link (furthest neighbour)
- again, durham and sunderland will be the first cluster - bc distance was smallest (19)
- but then the distance computed between this cluster and the other cities // between clusters will use the largest distance
- the SMALLEST value in the matrix is still used to determine the next cluster
Average link (Centroid clustering)
Still assigns the two points the smallest distance from on another together – but distances within the table are based on the average between the objects in the table.
What’s the difference between single and maximum link hierarchical agglomerative method?
Single link method tends to produce more “chaining” while the maximum link method creates several tightly defined clusters
combiing objects and using this as a value in the matrix. why is this bad
SCALE EFFECTS!!!
because the distance matrix is based on the combined scores? whichever variable is bigger (e.g., percent on a maths test > height in meters) will dominate
So if you were going to run a distance matrix you would have to account for this scalling issue
What are some scaling issues
- you need to account for any scaling issues when comparing the distance between objects in a dissimilarity matrix
- when similar data are rescaled – e.g., scores on a test – one out of 50 and another out of 75. The raw scores might join child a and b but the percentage scores join child b and c into one cluster
all this is because we use the Euclidean distanceas a measure of proximity. When we combine scores/rescale scores this measure does not maintain the rank ordering you might have in each variable.
scaling effects - problem with euclidean distance it doesn’t maintain the rank order
how can we fix this?
- rescale data – z transformation so you basically rescale all scores so they have a mean of 0 and SD of 1. Buts all variables on the same scale so when combining them theyre all weighed equally.
- In SPSS different ways to rescale data – basically just try to put everthing on the same scale
- Which way to chose – depends on your data, if all of them are equally important then z transformation is the way to go
- But if the raw data is meaningful by itself and its not so important he rank ordering is maintained then you might not want to do any transformation
With what type of data is it meaningful to compute Euclidean distance. With what is it not?
Interval/ratio scale data. Wouldn’t be meaningful with binary data
How can we do a cluster analysis on binary data
Transform the counts into some measure – now can be subject to clustering
- In SPSS different way to re-jigg this data (a,b,c,d) to get a measure of the similarity between x and y
- Measures differ in the way it thinks the absence of a feature is more important than the presence of a feature vice versa
Different ways to judge the similarity between 2 binary variables
Simple matching similarity measure (poss most common) – a + d / a + b + c + d
- Basically it’s the total number of matches divided by the total number of measures
Jaccard similarity measure or similarity ratio – a / a+ b + c
- Basically same as SMS but with the double negatives removed
Phi
- Binary form of the pearsons product-moment coefficient
After looking at the dendrogram of these birds how can we decide how to cluster them
Go back to your data – ask questions
- Are they woodland/farmland animals
- Are they going up or down in abundance
- Is any species particularly different from the rest
Dendogram doesn’t answer these qs!! Just gives you close where in your data you should look
What do you need to be aware of when conducting cluster analysis
- Similarity/dissimilarity matrix - do larger values mean data is more or less similar?
- What decisions to make about the criteria used to cluster the objects (mierarchial max link)
- Type of data you have (interval/ratio/binary) i.e., if you have binary data you may have to transform that into another measure first like simple matching similarity/phi
- Different techniques provide different solutions!
What is Euclidean matrix
Dissimilarity matrix