L9 Cluster Analysis Flashcards
What is the main difference between factor and cluster analysis?
In the factor analysis we want to find factors of items.
In the cluster analysis we want to find clusters of objects.
What is the goal of the cluster analysis?
Find clusters such that within a cluster the objects are as similar as possible (internal homogeneity) while at the same time the clusters are as distinct as possible (external heterogeneity)
Two ways of qunatifying similarity
Distance and correlation
Two types of distance measure
Euclidean & city-block distance
–> depends on the case which one to use. (On what you define as similar)
What is agglomerative hierarchical clustering?
When you create larger and less clusters
What is divisive hierarchical clustering?
When you create more and more clusters out of large ones
What is the procedure for derving clusters? In the agglomerative hierarchical approach
Starting point: calculate pairwise similarity between objects (based on distance or correlation)
Step1: Merge those objects with highest similarity (P and Q) into a cluster
Step2: Calculate linkage criterion for the new cluster and the other objects (or clusters)
Step3: Merge those objects and cluster that minimize the linkage criterion
Then repeat steps 2&3 until there is a single cluster
What does the coefficient measure in the graph?
It measures the heterogeity index. Heterogeneity increases with a larger coefficient.
What are three linkage methods?
Single linkage
Complete linkage
Ward’s method
What is single linkage?
You find the nearest neighbor
What is complete linkage?
You find the farthest neighbor
What is Ward’s method?
Minimize total distance (variance) within a considered cluster.
Most reliable method.
You create a centroid which is the mean value of the hypothetical cluster
What are 3 indices for model evaluation? (About how many clusters to retain)
- Within-cluster sum of squares (WSS)
- Information criteria:
- Bayesian information criterion (BIC)
- Akaike information criterion (AIC)
How can you minimize within cluster sum of squares?
The smallest WSS is always the max # of clusters available
What is the advantage of the Information criteria over WSS?
The BIC and AIC are a trade off between model fit and model complexity