cluster analysis Flashcards
what is cluster analysis
It allows us to simplify a mass of individual cases into fewer groups, or ‘clusters’ based on putting the most similar cases together.
why is cluster analysis useful
we can then explore the characteristics of a cluster and explore the relationships between clusters and variables we might be interested in.
cluster analysis is an analysis of ___
interdependence
what is an individual variable in cluster analysis called
a case
in cluster analysis what do you want to do between clusters and within clusters?
increase similarity between clusters and increase dissimilarity between clusters as much as possible
what is the process of putting certain cases together based on their similarities called
pairing
what are the names of the different statistical distances (how similar or dissimilar cases are) - 3
euclidean distances
manhattan distance
pearson distance
what is euclidean distance?
square root of the sum of the squared difference between each score for two observations
what is the most commonly used statistical distance measure
euclidean distance
what is manhattan distance
along the corridor and up the stairs. sum of the absolute distance between score in observations
what is an advantage of manhattan distance
reduces the influence of outliers
what is pearsons distance
square root of the sum of squared difference between observations divided by their variance.
what is an advantage of pearson distance
good for observing data that has differences in scale (different magnitude ranges)
what is single linkage (nearest neighbor)
take the 2 cases that are closest together in distance, then find the next closest, etc. creating a small number of meaningful clusters
what is complete linkage (furthest number)
the distance between the two clusters is based on the longest distance between any two members in the two clusters
what is average linkage
The distance between the two clusters is defined as the average distance between all pairs of the two clusters’ members
what is centroid linkage
the centre of each cluster is computed first. the distance between each cluster is then the distance between each centre of the cluster
can a cluster move once it has been joined
no
when are clusters specified in non heirarchal clustering
specified in advance
within non hierarchical clustering do you know about the factors beforehand?
yes - know and understand the factors
in non hierarchical clustering clusters are __-
fluid - can move up until the end of the process
how does hierarchical clustering work
clusters are combined sequentially until one cluster is left (each case is a seperate cluster)
in hierarchical clustering, clusters are ___
static (once a case is joined it does not move)
wards method is associated with which form of clustering
hierarchical clustering
k-means clustering is associated with which form of clustering
non-hierarchical
what is ward’s method
uses ANOVA to evaluate the distances between clusters. looking for similarity and dissimilarity between cases and clusters
what is k means clustering
an approach that produces clusters with the greatest possible distinction between clusters
give an explanation of how wards method and k means clustering can be used in succession
Ward’s method to get a sense of the possible number of clusters and then k-means clustering with the optimum number of clusters used to place all the cases in those clusters.
in a dendrogram 3 lines is equal to ___
3 clusters
what method of clustering does a dendrogram start with
hierarchical
what is an agglomeration schedule
a numerical version of a dendrogram
what are the limitations of cluster analysis (PROAM)
Presence of groups does not mean they are meaningful
how many clusters to Retain
Outliers can produce clustering issues
Absence of groups does not mean they don’t exist
can have no Missing data as need to calculate distances for all cases