L4 Cluster Analysis Flashcards

1
Q

What does cluster analysis involve?

A

Transforming the number of cases we have identified from our investigation in to a number of clusters based on the similar characteristics between them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Give a human and physical geography application of cluster analysis

A

Physical - investigation in to plant composition in the Amazon Rainforest may lead to clustering certain plants based on similar characteristics
Human - commercial use of the ‘if you like this, you might like this’ approach. This has been based on a number of cases (people) that have bought that item and the other suggested one. They therefore exert similar behavioural characteristics that have been clustered together. The new customer displays similar characteristics so is invited to potentially join that cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does the process of ‘pairing’ involve?

A

grouping and clustering together the different clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is pairing between cases based upon?

A

Statistical distance between cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 3 methods of classifying statistical distance?

A

Euclidean = square root of the sum of squared differences between each score for two observations. Essentially a straight line distance between 2 individual cases
Manhattan = sum of absolute difference between score in observation.
Pearson =

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Manhattan the best method for?

A

Handling outliers by reducing the impact they play upon the cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the Pearson method best for?

A

Handling large datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 4 linkage procedures for making clusters?

A
  1. Single
  2. Complete
  3. Centroid
  4. Average
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does single linkage clustering involve?

A

Creating clusters based on the most similar points within two clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does complete linkage clustering involve?

A

Creating clusters based on the most different cases within two clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does centroid linkage clustering involve?

A

Putting clusters together based on the central point within each cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does average linkage clustering involve?

A

The average linkage distance between all of the cases within a cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the two forms of a clustering process?

A
Hierarchical = combined in the order of most similarity to least similarity until just one cluster is left
Non-hierarchical = the number of desired clusters are specified in advance (most likely due to prior knowledge) and so the process of clustering continues until 3 definable clusters are identified.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Describe the process of non-hierarchical clustering?

A
  1. specify the number of sought clusters
  2. clustering process begins in which the cases can continue to move between the different clusters based on changing degrees of similarity
  3. once the number of pre-defined clusters have been created, the cases are fixed in place
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a dendrogram?

A

A graphical representation of the different stages that make up the clustering process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can a dendrogram be used?

A

We can visually cut the dendrogram at points where we want clusters to be.

17
Q

When we are visually cutting a dendrogram what do we need to look for and why?

A

Long lines - these show where a cluster goes a long time before being clustered together with another cluster. Because it goes a long time before it is clustered together this suggests that there is very little difference between them and therefore we would cut the dendrogram before these two clusters are put together.

18
Q

What is an agglomeration schedule?

A

A mathematical/tabular method for identifying the different clusters in a dataset.

19
Q

How would we identify a cluster in an agglomeration schedule?

A

Where there is a relatively large jump in the coefficients between two stages in the agglomeration schedule. The large jump represents a large jump required between the two stages in order to be combine them together and therefore suggests that there is relatively little similarity between the two clusters and it does not make sense to combine them in to one cluster.

20
Q

When are agglomeration schedules particularly useful when performing cluster analysis?

A

When we are dealing with large datasets?

21
Q

What are 5 things we should bear in mind when we are performing a cluster analysis?

A
  1. Logic/subjectivity (real-world application)
  2. No missing data and suitability of data to cluster analysis
  3. Outliers or skewed data can cause mistakes in the pairing process
  4. Absence of groups in the cluster analysis does not mean they do not exist - the cluster analysis may have missed them
  5. Presence of groups does not make them meaningful - cluster analysis may have produced clusters whcih have little meaning
22
Q

What is important about cluster and factor analysis together?

A

Often that factor and cluster analysis have been used simultaneously at both ends because many scenarios involve variables that are grouped together in to a simpler-to-understand factor which then could quite easily translate in to similar cases that can be clustered together

23
Q

Why can cluster analysis not proceed if there are missing values?

A

Because the process calculates the distance between the variables so if some of them are missing then it is unrepresentative

24
Q

What output in SPSS do we use to determine which cluster cases belong to?

A

Cross Tabulation Tables