Lecture 11 - Cluster Analysis Flashcards by Claire Jenkins

Why do we use cluster analysis?

often have substantial individual diffs in your data
for non-unimodal data!
useful to summarise data into discrete groups

How well did you know this?

Not at all

Perfectly

What are the 3 types of cluster analysis?

heirarchical
k-means
two-step

How well did you know this?

Not at all

Perfectly

What type of technique is cluster analysis?

exploratory, looks for latent classes of cases/variables

How well did you know this?

Not at all

Perfectly

Why don’t we use correlations? What do we use instead?

correlation asseses similar variation, not similarity aong the scores themselves
just because two variables are highly correlated does not mean that they are the closest variables to one another
can have perfect correlation, yet not be close values

SO… we use distance

How well did you know this?

Not at all

Perfectly

What are the 5 distance measures?

euclidean
block
minkowski-r
squared euclidean
power

How well did you know this?

Not at all

Perfectly

Block (city block) distance

sum of [x-y]

How well did you know this?

Not at all

Perfectly

Squared Euclidean

sum of [x-y]^2

How well did you know this?

Not at all

Perfectly

Euclidean distance

square root of sum of [x-y]^2

How well did you know this?

Not at all

Perfectly

Theoretically, what is the difference between Euclidean and city block distances?

euc: hypotenuse

- city: sum of sides (how a taxi driver would get there)

How well did you know this?

Not at all

Perfectly

What is a proximity matrix?

rows: variables
columns: variables
cells: distance between variables, using whichever distance measure you chose

How well did you know this?

Not at all

Perfectly

What are the steps in agglomerative hierarchical clustering methods?

start with proximity
combine 2 closest
combine again
final step: one cluster

How well did you know this?

Not at all

Perfectly

What do you need to decide for agglomerative hierarchical clustering?

what distance measure to use

- when to stop (how many clusters)

How well did you know this?

Not at all

Perfectly

What helps you decide how many clusters to use? What is importatn to notice about this?

the dendrogram

- the Y axis is NOT ordered

How well did you know this?

Not at all

Perfectly

What are the different methods for distances?

nearest neighbour (single linkage)
furthest neighbour (complete linkage)
average linkage (between or within)
Ward’s method (variance-based, uses SS between to minimise SS within)
centroid method (distance b/w means)
median method (similar to centroid, but small clusters weighted same as large)

How well did you know this?

Not at all

Perfectly

What are the general effects of the different distance methods?

single: large clusters
complete: tight
average between: compromise b/w large and tigh
average within: similar to complete
Ward’s: combines clusters with small/equal no. of data points

How well did you know this?

Not at all

Perfectly

Which combination methods should you choose?

Study These Flashcards

do not do single
average for well-behaved data
complete more stable than single
Ward’s and average are generally pretty good (when no outliers)
no method is always superior

How many clusters should you do?

Study These Flashcards

a matter of interpretation and choice

- when agglomeration coefficient becomes large

What is the agglomeration coefficient?

Study These Flashcards

how alike the two clusters being joined are

What are the advantages and disadvantages of clustering?

Study These Flashcards

simplest way of looking at latent classes

- BUT there are more sophisticated methods

What is different about k-means clustering than hierarchical?

Study These Flashcards

predefine no. of clusters
iterative process; reduced variation within each cluster
only clusters cases (not variables)

What are the steps on K-means clustering?

Study These Flashcards

define no. of clusters
set initial cluster means
find squared Euc distance from each case to the mean
allocate object to closest cluster
find new distance
allocate again
stop: when no change

What are the steps in 2-step clustering?

Study These Flashcards

step 1: cases grouped in many small sub-clusters (by using cluster trees)
step 2: sub-clusters and clustered using hierarchical agglomerative procedure

What are the advantages of 2-step clustering?

Study These Flashcards

combines k-means and hierarchical
handles outliers
allows for categorical and continuous
fancy output
can either choose no. of clusters or it will do it for you

What are the disadvantages of 2-step clustering?

Study These Flashcards

only cluster cases

- cluster membership can depend on order of cases in data file (only an issue of small sample sizes)

How is similarity shown in clusters?

- more similar = same cluster | - not similar = different cluster

What is the formula for distance?

Dxy = (E|x-y|^p)^ (1/q) ``` Block: p = q = 1 Euc: p = q = 2 Squared Euc: p = 2, q = 1 Minkowski: p = q = r Power: p /=/ q ```

How do you know what the clusters each mean/represent?

- up to you to decide | - similar for EFA, you have to label the latent groups

When do you actually use the distance measures?

when you have combined a cluster and you are trying to find the new distance b/w that cluster and the other variables for the next step

What is good about k-means clustering?

- produces reasonable numbers in clusters | - often used in market research

Lecture 11 - Cluster Analysis Flashcards

(29 cards)