Lecture 11 - Cluster Analysis Flashcards
Why do we use cluster analysis?
- often have substantial individual diffs in your data
- for non-unimodal data!
- useful to summarise data into discrete groups
What are the 3 types of cluster analysis?
- heirarchical
- k-means
- two-step
What type of technique is cluster analysis?
exploratory, looks for latent classes of cases/variables
Why don’t we use correlations? What do we use instead?
- correlation asseses similar variation, not similarity aong the scores themselves
- just because two variables are highly correlated does not mean that they are the closest variables to one another
- can have perfect correlation, yet not be close values
SO… we use distance
What are the 5 distance measures?
- euclidean
- block
- minkowski-r
- squared euclidean
- power
Block (city block) distance
sum of [x-y]
Squared Euclidean
sum of [x-y]^2
Euclidean distance
square root of sum of [x-y]^2
Theoretically, what is the difference between Euclidean and city block distances?
- euc: hypotenuse
- city: sum of sides (how a taxi driver would get there)
What is a proximity matrix?
- rows: variables
- columns: variables
- cells: distance between variables, using whichever distance measure you chose
What are the steps in agglomerative hierarchical clustering methods?
- start with proximity
- combine 2 closest
- combine again
- final step: one cluster
What do you need to decide for agglomerative hierarchical clustering?
- what distance measure to use
- when to stop (how many clusters)
What helps you decide how many clusters to use? What is importatn to notice about this?
the dendrogram
- the Y axis is NOT ordered
What are the different methods for distances?
- nearest neighbour (single linkage)
- furthest neighbour (complete linkage)
- average linkage (between or within)
- Ward’s method (variance-based, uses SS between to minimise SS within)
- centroid method (distance b/w means)
- median method (similar to centroid, but small clusters weighted same as large)
What are the general effects of the different distance methods?
- single: large clusters
- complete: tight
- average between: compromise b/w large and tigh
- average within: similar to complete
- Ward’s: combines clusters with small/equal no. of data points
Which combination methods should you choose?
- do not do single
- average for well-behaved data
- complete more stable than single
- Ward’s and average are generally pretty good (when no outliers)
- no method is always superior
How many clusters should you do?
- a matter of interpretation and choice
- when agglomeration coefficient becomes large
What is the agglomeration coefficient?
how alike the two clusters being joined are
What are the advantages and disadvantages of clustering?
- simplest way of looking at latent classes
- BUT there are more sophisticated methods
What is different about k-means clustering than hierarchical?
- predefine no. of clusters
- iterative process; reduced variation within each cluster
- only clusters cases (not variables)
What are the steps on K-means clustering?
- define no. of clusters
- set initial cluster means
- find squared Euc distance from each case to the mean
- allocate object to closest cluster
- find new distance
- allocate again
- stop: when no change
What are the steps in 2-step clustering?
- step 1: cases grouped in many small sub-clusters (by using cluster trees)
- step 2: sub-clusters and clustered using hierarchical agglomerative procedure
What are the advantages of 2-step clustering?
- combines k-means and hierarchical
- handles outliers
- allows for categorical and continuous
- fancy output
- can either choose no. of clusters or it will do it for you
What are the disadvantages of 2-step clustering?
- only cluster cases
- cluster membership can depend on order of cases in data file (only an issue of small sample sizes)
How is similarity shown in clusters?
- more similar = same cluster
- not similar = different cluster
What is the formula for distance?
Dxy = (E|x-y|^p)^ (1/q)
Block: p = q = 1 Euc: p = q = 2 Squared Euc: p = 2, q = 1 Minkowski: p = q = r Power: p /=/ q
How do you know what the clusters each mean/represent?
- up to you to decide
- similar for EFA, you have to label the latent groups
When do you actually use the distance measures?
when you have combined a cluster and you are trying to find the new distance b/w that cluster and the other variables for the next step
What is good about k-means clustering?
- produces reasonable numbers in clusters
- often used in market research