Lecture 11 - Cluster Analysis Flashcards

1
Q

Why do we use cluster analysis?

A
  • often have substantial individual diffs in your data
  • for non-unimodal data!
  • useful to summarise data into discrete groups
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 3 types of cluster analysis?

A
  • heirarchical
  • k-means
  • two-step
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What type of technique is cluster analysis?

A

exploratory, looks for latent classes of cases/variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why don’t we use correlations? What do we use instead?

A
  • correlation asseses similar variation, not similarity aong the scores themselves
  • just because two variables are highly correlated does not mean that they are the closest variables to one another
  • can have perfect correlation, yet not be close values

SO… we use distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 5 distance measures?

A
  • euclidean
  • block
  • minkowski-r
  • squared euclidean
  • power
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Block (city block) distance

A

sum of [x-y]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Squared Euclidean

A

sum of [x-y]^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Euclidean distance

A

square root of sum of [x-y]^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Theoretically, what is the difference between Euclidean and city block distances?

A
  • euc: hypotenuse

- city: sum of sides (how a taxi driver would get there)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a proximity matrix?

A
  • rows: variables
  • columns: variables
  • cells: distance between variables, using whichever distance measure you chose
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the steps in agglomerative hierarchical clustering methods?

A
  • start with proximity
  • combine 2 closest
  • combine again
  • final step: one cluster
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do you need to decide for agglomerative hierarchical clustering?

A
  • what distance measure to use

- when to stop (how many clusters)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What helps you decide how many clusters to use? What is importatn to notice about this?

A

the dendrogram

- the Y axis is NOT ordered

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the different methods for distances?

A
  • nearest neighbour (single linkage)
  • furthest neighbour (complete linkage)
  • average linkage (between or within)
  • Ward’s method (variance-based, uses SS between to minimise SS within)
  • centroid method (distance b/w means)
  • median method (similar to centroid, but small clusters weighted same as large)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the general effects of the different distance methods?

A
  • single: large clusters
  • complete: tight
  • average between: compromise b/w large and tigh
  • average within: similar to complete
  • Ward’s: combines clusters with small/equal no. of data points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which combination methods should you choose?

A
  • do not do single
  • average for well-behaved data
  • complete more stable than single
  • Ward’s and average are generally pretty good (when no outliers)
  • no method is always superior
17
Q

How many clusters should you do?

A
  • a matter of interpretation and choice

- when agglomeration coefficient becomes large

18
Q

What is the agglomeration coefficient?

A

how alike the two clusters being joined are

19
Q

What are the advantages and disadvantages of clustering?

A
  • simplest way of looking at latent classes

- BUT there are more sophisticated methods

20
Q

What is different about k-means clustering than hierarchical?

A
  • predefine no. of clusters
  • iterative process; reduced variation within each cluster
  • only clusters cases (not variables)
21
Q

What are the steps on K-means clustering?

A
  • define no. of clusters
  • set initial cluster means
  • find squared Euc distance from each case to the mean
  • allocate object to closest cluster
  • find new distance
  • allocate again
  • stop: when no change
22
Q

What are the steps in 2-step clustering?

A
  • step 1: cases grouped in many small sub-clusters (by using cluster trees)
  • step 2: sub-clusters and clustered using hierarchical agglomerative procedure
23
Q

What are the advantages of 2-step clustering?

A
  • combines k-means and hierarchical
  • handles outliers
  • allows for categorical and continuous
  • fancy output
  • can either choose no. of clusters or it will do it for you
24
Q

What are the disadvantages of 2-step clustering?

A
  • only cluster cases

- cluster membership can depend on order of cases in data file (only an issue of small sample sizes)

25
Q

How is similarity shown in clusters?

A
  • more similar = same cluster

- not similar = different cluster

26
Q

What is the formula for distance?

A

Dxy = (E|x-y|^p)^ (1/q)

Block: p = q = 1
Euc: p = q = 2
Squared Euc: p = 2, q = 1
Minkowski: p = q = r
Power: p /=/ q
27
Q

How do you know what the clusters each mean/represent?

A
  • up to you to decide

- similar for EFA, you have to label the latent groups

28
Q

When do you actually use the distance measures?

A

when you have combined a cluster and you are trying to find the new distance b/w that cluster and the other variables for the next step

29
Q

What is good about k-means clustering?

A
  • produces reasonable numbers in clusters

- often used in market research