Lecture 11 - Cluster Analysis Flashcards
Why do we use cluster analysis?
- often have substantial individual diffs in your data
- for non-unimodal data!
- useful to summarise data into discrete groups
What are the 3 types of cluster analysis?
- heirarchical
- k-means
- two-step
What type of technique is cluster analysis?
exploratory, looks for latent classes of cases/variables
Why don’t we use correlations? What do we use instead?
- correlation asseses similar variation, not similarity aong the scores themselves
- just because two variables are highly correlated does not mean that they are the closest variables to one another
- can have perfect correlation, yet not be close values
SO… we use distance
What are the 5 distance measures?
- euclidean
- block
- minkowski-r
- squared euclidean
- power
Block (city block) distance
sum of [x-y]
Squared Euclidean
sum of [x-y]^2
Euclidean distance
square root of sum of [x-y]^2
Theoretically, what is the difference between Euclidean and city block distances?
- euc: hypotenuse
- city: sum of sides (how a taxi driver would get there)
What is a proximity matrix?
- rows: variables
- columns: variables
- cells: distance between variables, using whichever distance measure you chose
What are the steps in agglomerative hierarchical clustering methods?
- start with proximity
- combine 2 closest
- combine again
- final step: one cluster
What do you need to decide for agglomerative hierarchical clustering?
- what distance measure to use
- when to stop (how many clusters)
What helps you decide how many clusters to use? What is importatn to notice about this?
the dendrogram
- the Y axis is NOT ordered
What are the different methods for distances?
- nearest neighbour (single linkage)
- furthest neighbour (complete linkage)
- average linkage (between or within)
- Ward’s method (variance-based, uses SS between to minimise SS within)
- centroid method (distance b/w means)
- median method (similar to centroid, but small clusters weighted same as large)
What are the general effects of the different distance methods?
- single: large clusters
- complete: tight
- average between: compromise b/w large and tigh
- average within: similar to complete
- Ward’s: combines clusters with small/equal no. of data points