Week 13: More advanced Methods - Cluster Analysis Flashcards
What is an exploratory data analysis tool for organizing observed data into meaningful clusters, based on combinations of variables?
Cluster analysis
Example of When to look at grouping (cluster) patterns:
- A PT practitioner would like to group patients according to their attributes in order to better treat them with personalized care plan
- A PT practitioner would like to classify patients based on their individual health records in order to develop specific management strategies that are appropriate to the patients
Hierarchical clustering -
- a set of nested clusters organized using a hierarchical tree
- the clustering is mapped into a hierarchy basing its grouping on the inter-cluster similarities or dissimilarities
Non-hierarchical clustering -
- a group of individuals into non-overlapping subsets (clusters) such that each object is in exactly one cluster
- Divide a dataset of n individuals into m clusters
What is the most commonly used non-hierarchical technique?
K-mean clustering
What are 3 types of clustering techniques?
- Hierarchical clustering
- K-mean clustering
- Two-step clustering
Bottom-up or agglomerative hierarchical clustering -
starts with one single piece of datum and then merge it with others to form larger groups
Top-down or divisive hierarchical clustering -
starts with all in one group and then partition the data step by step using a flat clustering algorithm
Agglomerative hierarchical clustering procedure:
Step 1: Assign each item to a cluster
Step 2: Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less
Step 3: Compute distances (similarities) between the new cluster and each of the old clusters
Step 4: Repeat steps 2 and 3 until all items are clustered into a single cluster of the original sample size
3 Limitations of hierarchical clustering:
1. Arbitrary decisions
- necessary to specify both the distance metric and the linkage criteria without any strong theoretical basis for such decisions
3 Limitations of hierarchical clustering:
2. Data types
- works well with continuous data
3 Limitations of hierarchical clustering:
3. Misinterpretation of dendrogram
- selecting the number of clusters using dendrogram may mislead
K-mean clustering -
clustering algorithm where data is classified into K number of clusters this is the most widely used clustering method each individual data is mapped into the cluster with its nearest mean
Procedure for k-mean clustering:
Step 1: Select K points as the initial centroids
Step 2: Assign points to different centroids based on proximity
Step 3: Re-evaluate the centroid of each group
Step 4: Repeat Steps 2 and 3 until the best solutions emerges (the centers are stable)
Limitations of k-mean clustering:
- K-mean is subjective
1. The researcher chooses the number of clusters
2. More Ks (number of clusters), shorter distance from the centroid
3. As an extreme scenario: - When every data point is a centroid, the distance is zero.
- But it is useless
4. What is the optimal K?