Week 13: More advanced Methods - Cluster Analysis Flashcards by Lauryn Kobiela

What is an exploratory data analysis tool for organizing observed data into meaningful clusters, based on combinations of variables?

Cluster analysis

How well did you know this?

Not at all

Perfectly

Example of When to look at grouping (cluster) patterns:

A PT practitioner would like to group patients according to their attributes in order to better treat them with personalized care plan
A PT practitioner would like to classify patients based on their individual health records in order to develop specific management strategies that are appropriate to the patients

How well did you know this?

Not at all

Perfectly

Hierarchical clustering -

a set of nested clusters organized using a hierarchical tree
the clustering is mapped into a hierarchy basing its grouping on the inter-cluster similarities or dissimilarities

How well did you know this?

Not at all

Perfectly

Non-hierarchical clustering -

a group of individuals into non-overlapping subsets (clusters) such that each object is in exactly one cluster
Divide a dataset of n individuals into m clusters

How well did you know this?

Not at all

Perfectly

What is the most commonly used non-hierarchical technique?

K-mean clustering

How well did you know this?

Not at all

Perfectly

What are 3 types of clustering techniques?

Hierarchical clustering
K-mean clustering
Two-step clustering

How well did you know this?

Not at all

Perfectly

Bottom-up or agglomerative hierarchical clustering -

starts with one single piece of datum and then merge it with others to form larger groups

How well did you know this?

Not at all

Perfectly

Top-down or divisive hierarchical clustering -

starts with all in one group and then partition the data step by step using a flat clustering algorithm

How well did you know this?

Not at all

Perfectly

Agglomerative hierarchical clustering procedure:

Step 1: Assign each item to a cluster
Step 2: Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less
Step 3: Compute distances (similarities) between the new cluster and each of the old clusters
Step 4: Repeat steps 2 and 3 until all items are clustered into a single cluster of the original sample size

How well did you know this?

Not at all

Perfectly

3 Limitations of hierarchical clustering:

1. Arbitrary decisions

necessary to specify both the distance metric and the linkage criteria without any strong theoretical basis for such decisions

How well did you know this?

Not at all

Perfectly

3 Limitations of hierarchical clustering:

2. Data types

works well with continuous data

How well did you know this?

Not at all

Perfectly

3 Limitations of hierarchical clustering:

3. Misinterpretation of dendrogram

selecting the number of clusters using dendrogram may mislead

How well did you know this?

Not at all

Perfectly

K-mean clustering -

clustering algorithm where data is classified into K number of clusters this is the most widely used clustering method each individual data is mapped into the cluster with its nearest mean

How well did you know this?

Not at all

Perfectly

Procedure for k-mean clustering:

Step 1: Select K points as the initial centroids
Step 2: Assign points to different centroids based on proximity
Step 3: Re-evaluate the centroid of each group
Step 4: Repeat Steps 2 and 3 until the best solutions emerges (the centers are stable)

How well did you know this?

Not at all

Perfectly

Limitations of k-mean clustering:

K-mean is subjective
1. The researcher chooses the number of clusters
2. More Ks (number of clusters), shorter distance from the centroid
3. As an extreme scenario:
When every data point is a centroid, the distance is zero.
But it is useless
4. What is the optimal K?

How well did you know this?

Not at all

Perfectly

What is two-step clustering?

Study These Flashcards

hybrid approach where we run pre-clustering first and then run hierarchical methods (this is why it has this name)

What 3 features differentiate two-step clustering from traditional clustering techniques?

Study These Flashcards

the ability to create clusters based on both categorical and continuous variables
automatic selection of the number of clusters
the ability to analyze large data set efficiently

Procedure of two-step clustering:

Study These Flashcards

Step 1: A sequential approach is used to pre-cluster the cases by condensing the variables (pre-clustering)
Step 2: The pre-clusters are statistically merged into the desired number of clusters (clustering)

What 2 limitations can two-step clustering overcome?

Study These Flashcards

It can take both continuous and categorical data
There is no need to enter the number of clusters a priority because it uses indexes of fit (AIC or BIC) to compare each cluster solution to determine which number of cluster is best

Cluster quality validation index :

Silhouette coefficient -

Study These Flashcards

it measures how well an individual data is clustered and it estimates the average distance between clusters

Cluster quality validation index :

Silhouette plot -

Study These Flashcards

it displays a measure of how close each point in one cluster is to points in the neighboring cluster

Interpretation with Silhouette coefficient:

Large Silhouette coefficient value of almost 1 -

Study These Flashcards

very well clustered

Interpretation with Silhouette coefficient:

negative Silhouette coefficient value -

Study These Flashcards

probably placed in the wrong cluster

Interpretation with Silhouette coefficient:

small Silhouette coefficient value of around 0 -

Study These Flashcards

lies between two clusters

Silhouette value of 0.5-1 =

Good

Silhouette value of .2-.5 =

Fair

Silhouette value of -1 to .2 =

Poor

application of cluster analysis involves what?

grouping similar cases into homogenous groups (called clusters) when the grouping is not previously known

With hierarchical clustering -

the clustering is mapped into a hierarchy basing its grouping on the inter-cluster similarities or dissimilarities

With k-mean clustering -

data is classified into K number of clusters mapping each individual data into the cluster with its nearest mean

With two-step clustering -

a sequential approach is first used to pre-cluster the cases, and second the pre-clusters are statistically merged into the desired number of clusters

Why might Two step clustering may be a better choice over hierarchical or k-mean?

the two step clustering can work with categorical data and it is not bound to an arbitrary choice of the number of clusters

Week 13: More advanced Methods - Cluster Analysis Flashcards

(32 cards)