More Advanced Methods; Cluster Analysis Flashcards
Cluster Analysis is an __________ data analysis tool for organizing observed data into meaningful clusters, based on combinations of variables.
exploratory
When to look at grouping (cluster) patterns?
- A PT practitioner would like to group patients according to their attributes in order to better treat them with personalized care plan.
- A PT practitioner would like to classify patients based on their individual health records in order to develop specific management strategies that are appropriate to the patients.
What are the 2 types of clusters?
- ) Hierarchical Clustering
2. ) Non-hierarchical Clustering
Hierarchical Clustering:
- A set of nested clusters organized using a hierarchical _____.
- The hierarchical methods produce a set of nested clusters in which each pair of individuals or clusters is progressively nested in a larger cluster until only ____ cluster remains, or all individuals in one group are partitioned step by step.
- tree
- one
Non-hierarchical Clustering:
- A group of individuals into non-overlapping subsets (clusters) such that each object is in exactly __ cluster.
- The non-hierarchical methods divide a dataset of n individuals into m clusters.
- ________ clustering is the most commonly used non-hierarchical technique.
- one
- K-mean
What are the 3 types of clustering techniques?
- Hierarchical Clustering
- K-Mean Clustering
- Two-Step Clustering
__________ clustering is a clustering algorithm where
the clustering is mapped into a hierarchy basing its grouping on the inter-cluster similarities or dissimilarities.
Hierarchical
What are the two “types” of Hierarchical Clustering?
- Bottom-Up (agglomerative)
- Top-Down (divisive)
What is bottom-up clustering?
Starts with 1 single piece of datum and then merge it with others to form larger groups.
What is top-down clustering?
Starts with all in one group and then partition the data step by step using a flat clustering algorithm.
What are the steps for bottom-up (agglomerative) clustering?
- ) Assign each item to a cluster.
- ) Find the closest (most similar) pair of clusters and merge them into a single cluster, so there is now one cluster less.
- ) Compute distances (similarities) between the new cluster and each of the old clusters.
- ) Repeat steps 2 and 3 until all items are clustered into a single cluster of the original sample size.
What are the limitations of Hierarchical Clustering?
- Arbitrary decisions- necessary to specify both the distance metric and the linkage criteria without any strong theoretical basis for such decisions.
- Data types- works well with continuous data.
- Misinterpretation of dendrogram- selecting the number of clusters using dendrogram may mislead.
_______ clustering is a clustering algorithm where data is classified into K number of clusters. This is the most widely used clustering method. Each individual data is mapped into the cluster with its nearest mean.
K-mean
What are the steps for K-Mean Clustering?
- ) Select K points as the initial centroids.
- ) Assign points to different centroids based on proximity.
- ) Re-evaluate the centroid of each group.
- ) Repeat steps 2 and 3 until the best solutions emerges (the centers are stable).
What are the limitations of K-Mean Clustering?
- The researcher chooses the number of clusters.
- More Ks (number of clusters), shorter distance from the centroid.
- As an extreme scenario: When every data point is a centroid, the distance is zero.
- What is the optimal K?
__________ clustering is a hybri approach where we run pre-clustering first and then run hierarchical methods.
Two-Step
What are the features that differentiate Two-Step Clustering from traditional clustering techniques?
- The ability to create clusters based on both categorical and continuous variables.
- Automatic selection of the number of clusters.
- The ability to analyze large data set efficiently.
What are the steps for Two-Step Clustering?
- ) A sequential approach is used to pre-cluster the cases by condensing the variables (pre-clustering).
- ) The pre-clusters are statistically merged into the desired number of clusters (clustering).
What are the limitations of Two-Step Clustering?
- It can take both continuous and categorical data.
- There is no need to enter the number of clusters a priori because it uses indexes of fit (AIC or BIC) to compare each cluster solution to determine which number of cluster is best.
Cluster Quality Validation Index:
- ______________ measures how well an individual data is clustered and it estimates the average distance between clusters.
- ______________ displays a measure of how close each point in one cluster is to points in the neighboring cluster.
- Data with a large silhouette coefficient value of almost 1 means what?
- Data with a small silhouette coefficient value of almost 0 means what?
- Data with a negative silhouette coefficient means what?
- silhouette coefficient
- silhouette plot
- very well clustered
- lies between 2 clusters
- probably placed in wrong cluster
Silhouette Values:
- 0.5-1.0 = _____
- 0.2-0.5 = _____
- -1.0-0.2 = ______
- Good
- Fair
- Poor
Summary for More Advanced Methods; Cluster Analysis:
- The application of ___________ involves grouping similar cases into homogenous groups (called clusters) when the grouping is not previously known.
- With ________ clustering, the clustering is mapped into a hierarchy basing its grouping on the inter-cluster similarities or dissimilarities.
- With _______ clustering, data is classified into K number of clusters mapping each individual data into the cluster with its nearest mean.
- With ________ clustering, a sequential approach is first used to pre-cluster the cases, and second the pre-clusters are statistically merged into the desired number of clusters.
- Two step clustering may be a better choice over hierarchical or k-mean because the two step clustering can work with _________ data and it is not bound to an arbitrary choice of the number of clusters.
- cluster analysis
- hierarchical
- K-mean
- Two-step
- categorical