Cluster Analysis Flashcards
Multivariate Statistics
Cluster = multivariate statistics
2 uses of multivariate statistics: dimensionality reduction and uncovering structure (unsupervised learning)
EFA Summary
Finds structure in a set of variables by processing correlations between them
Extracts factors to best represent the variables’ interrelations
Measurements provide a dimensional model
Cluster Analysis
Finds structure in data
Looks for categories
Uses information about similarity between objects rather than correlations between variables
Classifies variables into categories based on similarities
Grouping Data into Clusters
Given a set of points and a notion of distance between points, creates clusters
Members of clusters are similar and members of other clusters are dissimilar
Points in high-dimensional space
Similarity defined using a distance measure
Euclidean Distance
Length of line estimated by using dimesnion of x and y and squared differences
Very sensitive - affected by outliers
Hierarchical Clustering - Agglomerative
Bottom up
Initially each datapoint is a cluster and clusters are recursively combined
Combines the two nearest clusters into a large one and keeps going
Hierarchical Clustering - Divisive
Top down
Initially all datapoints are a single cluster and is recursively split into smaller ones
Assumers they belong to one big cluster then breaks it down into smaller clusters
Point Assignments
K-means
Maintain a set of clusters (k)
Points belong to ‘nearest’ cluster
Dendrogram
Tree-like diagram of relating points
Euclidean Space
Stopping combining clusters
- pick a number (k) upfront and stop when clusters = k
- stop when next merge will create low cohesion
Cohesion - cluster diameter, radius from centroid
K-Means Clustering
Point-assignment method
Preferable for very large datasets
Assumes Euclidean space
Things become more manageable
Method of k-means Clustering
Assigns the k centroids to points
Creates clusters by assigning points to the cluster whose centroids they are closest to
Centroids selected then data is assigned to each centroid they are closest to
Reassigned by taking averages
Data points reassigned until it reaches maximum number of iterations
Limitations of Clustering
Number
Objects to cluster - representative and random
Only include variables with good reason
Require interpretation
Validation of classification
Define number of clusters after research
Choose objects carefully
Unmeaningful data = unmeaningful clusters
Validation using other measures and attempt to see if they differ significantly from other measures