L4 Cluster Analysis Flashcards

Question 1

Q

What does cluster analysis involve?

Answer

A

Transforming the number of cases we have identified from our investigation in to a number of clusters based on the similar characteristics between them

Question 2

Q

Give a human and physical geography application of cluster analysis

Answer

A

Physical - investigation in to plant composition in the Amazon Rainforest may lead to clustering certain plants based on similar characteristics
Human - commercial use of the ‘if you like this, you might like this’ approach. This has been based on a number of cases (people) that have bought that item and the other suggested one. They therefore exert similar behavioural characteristics that have been clustered together. The new customer displays similar characteristics so is invited to potentially join that cluster

Question 3

Q

What does the process of ‘pairing’ involve?

Answer

A

grouping and clustering together the different clusters

Question 4

Q

What is pairing between cases based upon?

Answer

A

Statistical distance between cases

Question 5

Q

What are the 3 methods of classifying statistical distance?

Answer

A

Euclidean = square root of the sum of squared differences between each score for two observations. Essentially a straight line distance between 2 individual cases
Manhattan = sum of absolute difference between score in observation.
Pearson =

Question 6

Q

What is Manhattan the best method for?

Answer

A

Handling outliers by reducing the impact they play upon the cluster

Question 7

Q

What is the Pearson method best for?

Answer

A

Handling large datasets

Question 8

Q

What are the 4 linkage procedures for making clusters?

Answer

A

Single
Complete
Centroid
Average

Question 9

Q

What does single linkage clustering involve?

Answer

A

Creating clusters based on the most similar points within two clusters

Question 10

Q

What does complete linkage clustering involve?

Answer

A

Creating clusters based on the most different cases within two clusters

Question 11

Q

What does centroid linkage clustering involve?

Answer

A

Putting clusters together based on the central point within each cluster

Question 12

Q

What does average linkage clustering involve?

Answer

A

The average linkage distance between all of the cases within a cluster

Question 13

Q

What are the two forms of a clustering process?

Answer

A

Hierarchical = combined in the order of most similarity to least similarity until just one cluster is left
Non-hierarchical = the number of desired clusters are specified in advance (most likely due to prior knowledge) and so the process of clustering continues until 3 definable clusters are identified.

Question 14

Q

Describe the process of non-hierarchical clustering?

Answer

A

specify the number of sought clusters
clustering process begins in which the cases can continue to move between the different clusters based on changing degrees of similarity
once the number of pre-defined clusters have been created, the cases are fixed in place

Question 15

Q

What is a dendrogram?

Answer

A

A graphical representation of the different stages that make up the clustering process.

Question 16

Q

How can a dendrogram be used?

Answer

A

We can visually cut the dendrogram at points where we want clusters to be.

Question 17

Q

When we are visually cutting a dendrogram what do we need to look for and why?

Answer

A

Long lines - these show where a cluster goes a long time before being clustered together with another cluster. Because it goes a long time before it is clustered together this suggests that there is very little difference between them and therefore we would cut the dendrogram before these two clusters are put together.

Question 18

Q

What is an agglomeration schedule?

Answer

A

A mathematical/tabular method for identifying the different clusters in a dataset.

Question 19

Q

How would we identify a cluster in an agglomeration schedule?

Answer

A

Where there is a relatively large jump in the coefficients between two stages in the agglomeration schedule. The large jump represents a large jump required between the two stages in order to be combine them together and therefore suggests that there is relatively little similarity between the two clusters and it does not make sense to combine them in to one cluster.

Question 20

Q

When are agglomeration schedules particularly useful when performing cluster analysis?

Answer

A

When we are dealing with large datasets?

Question 21

Q

What are 5 things we should bear in mind when we are performing a cluster analysis?

Answer

A

Logic/subjectivity (real-world application)
No missing data and suitability of data to cluster analysis
Outliers or skewed data can cause mistakes in the pairing process
Absence of groups in the cluster analysis does not mean they do not exist - the cluster analysis may have missed them
Presence of groups does not make them meaningful - cluster analysis may have produced clusters whcih have little meaning

Question 22

Q

What is important about cluster and factor analysis together?

Answer

A

Often that factor and cluster analysis have been used simultaneously at both ends because many scenarios involve variables that are grouped together in to a simpler-to-understand factor which then could quite easily translate in to similar cases that can be clustered together

Question 23

Q

Why can cluster analysis not proceed if there are missing values?

Answer

A

Because the process calculates the distance between the variables so if some of them are missing then it is unrepresentative

Question 24

Q

What output in SPSS do we use to determine which cluster cases belong to?

Answer

A

Cross Tabulation Tables