Week 13: More advanced Methods - Cluster Analysis Flashcards
What is an exploratory data analysis tool for organizing observed data into meaningful clusters, based on combinations of variables?
Cluster analysis
Example of When to look at grouping (cluster) patterns:
- A PT practitioner would like to group patients according to their attributes in order to better treat them with personalized care plan
- A PT practitioner would like to classify patients based on their individual health records in order to develop specific management strategies that are appropriate to the patients
Hierarchical clustering -
- a set of nested clusters organized using a hierarchical tree
- the clustering is mapped into a hierarchy basing its grouping on the inter-cluster similarities or dissimilarities
Non-hierarchical clustering -
- a group of individuals into non-overlapping subsets (clusters) such that each object is in exactly one cluster
- Divide a dataset of n individuals into m clusters
What is the most commonly used non-hierarchical technique?
K-mean clustering
What are 3 types of clustering techniques?
- Hierarchical clustering
- K-mean clustering
- Two-step clustering
Bottom-up or agglomerative hierarchical clustering -
starts with one single piece of datum and then merge it with others to form larger groups
Top-down or divisive hierarchical clustering -
starts with all in one group and then partition the data step by step using a flat clustering algorithm
Agglomerative hierarchical clustering procedure:
Step 1: Assign each item to a cluster
Step 2: Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less
Step 3: Compute distances (similarities) between the new cluster and each of the old clusters
Step 4: Repeat steps 2 and 3 until all items are clustered into a single cluster of the original sample size
3 Limitations of hierarchical clustering:
1. Arbitrary decisions
- necessary to specify both the distance metric and the linkage criteria without any strong theoretical basis for such decisions
3 Limitations of hierarchical clustering:
2. Data types
- works well with continuous data
3 Limitations of hierarchical clustering:
3. Misinterpretation of dendrogram
- selecting the number of clusters using dendrogram may mislead
K-mean clustering -
clustering algorithm where data is classified into K number of clusters this is the most widely used clustering method each individual data is mapped into the cluster with its nearest mean
Procedure for k-mean clustering:
Step 1: Select K points as the initial centroids
Step 2: Assign points to different centroids based on proximity
Step 3: Re-evaluate the centroid of each group
Step 4: Repeat Steps 2 and 3 until the best solutions emerges (the centers are stable)
Limitations of k-mean clustering:
- K-mean is subjective
1. The researcher chooses the number of clusters
2. More Ks (number of clusters), shorter distance from the centroid
3. As an extreme scenario: - When every data point is a centroid, the distance is zero.
- But it is useless
4. What is the optimal K?
What is two-step clustering?
hybrid approach where we run pre-clustering first and then run hierarchical methods (this is why it has this name)
What 3 features differentiate two-step clustering from traditional clustering techniques?
- the ability to create clusters based on both categorical and continuous variables
- automatic selection of the number of clusters
- the ability to analyze large data set efficiently
Procedure of two-step clustering:
Step 1: A sequential approach is used to pre-cluster the cases by condensing the variables (pre-clustering)
Step 2: The pre-clusters are statistically merged into the desired number of clusters (clustering)
What 2 limitations can two-step clustering overcome?
- It can take both continuous and categorical data
- There is no need to enter the number of clusters a priority because it uses indexes of fit (AIC or BIC) to compare each cluster solution to determine which number of cluster is best
Cluster quality validation index :
Silhouette coefficient -
it measures how well an individual data is clustered and it estimates the average distance between clusters
Cluster quality validation index :
Silhouette plot -
it displays a measure of how close each point in one cluster is to points in the neighboring cluster
Interpretation with Silhouette coefficient:
Large Silhouette coefficient value of almost 1 -
very well clustered
Interpretation with Silhouette coefficient:
negative Silhouette coefficient value -
probably placed in the wrong cluster
Interpretation with Silhouette coefficient:
small Silhouette coefficient value of around 0 -
lies between two clusters
Silhouette value of 0.5-1 =
Good
Silhouette value of .2-.5 =
Fair
Silhouette value of -1 to .2 =
Poor
application of cluster analysis involves what?
grouping similar cases into homogenous groups (called clusters) when the grouping is not previously known
With hierarchical clustering -
the clustering is mapped into a hierarchy basing its grouping on the inter-cluster similarities or dissimilarities
With k-mean clustering -
data is classified into K number of clusters mapping each individual data into the cluster with its nearest mean
With two-step clustering -
a sequential approach is first used to pre-cluster the cases, and second the pre-clusters are statistically merged into the desired number of clusters
Why might Two step clustering may be a better choice over hierarchical or k-mean?
the two step clustering can work with categorical data and it is not bound to an arbitrary choice of the number of clusters