Cluster Analysis Flashcards
When to look at cluster patterns
PT would like to group patients according to their attributes in order to better treat them
PT would like to classify patients based on their individual health records in order to develop specific appropriate management strategies
Hierarchical clustering
set of nested clusters organized using hierarchical tree
produce a set of nested clusters. each pair of individuals or clusters progressively nested in larger until only one remains
Non-Hierarchical clustering
group of individuals into clusters so that each object is in exactly one cluster
divides a data set of ‘n’ individuals into ‘m’ clusters
K-mean clustering most commonly used type
Hierarchical Clustering:
Bottom-up (agglomerative)
starts with one single piece of data and then merge it with others to form larger groups
Hierarchical Clustering: Top down (divisive)
starts with all in one group and then partition data step by step using a flat clustering algorithm
Procedure of Agglomerative style
- assign each item to a cluster
- find closest pair of clusters and merge into a single cluster
- compute distances (similarities) between the new cluster and each of the old clusters
- repeat steps 2 and 3 until all items are clustered into a single cluster of the original sample size
Limitations of Hierarchical Clustering
necessary to specifiy both distance metric and linkage criteria without any strong theoretical basis
selecting the number of clusters using dendrogram may mislead
K-Mean Clustering
data is classified into K number of clusters.
each individual data is mapped into the cluster with its nearest mean
K-Mean Clustering:
Procedure
- select K points as initial centroids
- assign points to different centroids based on proximity
- re-evaluate centroid of each group
- repeat steps 2 and 3 until best solutions emerge (centers are stable)
K-Mean Clustering:
Limitations
researcher chooses number of clusters
more Ks=shorter distance from centroid
when every data point is a centroid the distance is 0 but is useless
Two Step Clustering
run pre-clustering first and then hierarchical methods.
- can have categorical AND continuous clusters
- automatic selection of number of clusters
- ability to analyze large data set efficiently
Two Step Clustering:
Procedure
- a sequential approach is used to pre-cluster the cases by condensing the variables
- the pre-clusters are statistically merged into the desired # of clusters
Cluster Quality Validation Index:
Silhouette coefficient
measures how well an individual data is clustered and estimates the average distance between clusters
Cluster Quality Validation Index:
Silhouette plot
displays a measure of how close each point in one cluster is to points in the neighboring cluster
Interpretation with Silhouette coefficient:
individual data with large Silhouette coefficient value of almost 1
very well clustered