Cluster Analysis Flashcards

1
Q

Cluster analysis is based on and involves

A

Cluster analysis is based on the concept of similarity. Groups are formed by the pairing of individual cases within a dataset according to how similar they are on either a series of two or more scales or measures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Cluster analysis techniques

A

Allow the researcher to examine how cases in the dataset are related to each other across a range of variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Cluster analysis - distance

A

Observations are related and paired to each other on the basis of distances, which are defined by the differences between the scores of one observation and corresponding scores of another. Distance (D) is based on the notion of n x v dimensional space, where D = (n, v). D = dimensional space. n = number of observations. v = number of variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Euclidean distance

A

Square root of the sum of squared difference between each score for two observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Manhattan distance

A

Sum of the absolute difference between scores in observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Pearson distance

A

Square root of the sum of squared difference between observations divided by their variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Pairing process

A

Depending on the distance measure and linkage method adopted, a process of agglomeration will take place by adding observations to clusters until just one cluster remains that contains all individual observations. The pairing process continues until one ‘cluster’ has been forumlated from the cases in the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Single linkage

A

This is based on the minimum distance between an observation in one cluster and that of another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Average linkage

A

This method examines not only the distances between two observations in different clusters but also the distance between the cluster centres.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Centroid linkage

A

This uses another avergaing technique that attempts to link clusters according to the cluster means.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Complete linkage

A

This offers a means by which to examine the maximum possible distance between an observation in one cluster and that of another. The clusters tend to be of a relatively similar size and uniformity. This can mean the outliers are more significant, pulling the maximum limits of any given cluster and skewing the result.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Hierarchical clustering

A

Each case is a seperate cluster, clusters are combined sequentially until one cluster is left or once a case is joined it does not move

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Median linkage

A

Ensures that the median, and not the mean, distance between two clusters provides the distance measure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Method to evaluate the distances between clusters

A

Most popular approach is Ward’s method that uses ANOVA to evaluate the distances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Non hierarchical clustering

A

Clusters are specified in advanced and cases can move clusters right up until the end of the process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Most popular non hierarchical clustering approach

A

k-means clustering - produces clusters with the greatest possible distinction between clusters.

17
Q

Dendrograms

A

Diagrammatic representation of the pairing process indicating how many clusters existed at any particular part of the process. Every observation is represented as an individual, therefore clusters and individuals not found to be identical at the first stage are represented by vertical lines.

18
Q

Cutting of Dendrograms

A

Cutting is a subjective and somewhat difficult process as the number of clusters one wants to retain for further analysis depends on the research objectives. The deicision may also depend on whether the work is explanatory or confirmatory. Where there is a confirmatory element in the research, the dendrogram might be cut according to the relevant number of clusters sought.

19
Q

Cautions/Limitations

A

The subjectivity of how many clusters to retain, no missing data can be permitted, outliers or skewed data can produce problems for clustering, the absence of groups doesn’t mean they exist; and, the presence of groups doesn’t make them meaningful.