Cluster Analysis Flashcards
Cluster Analysis is …
Grouping observations based on their key characteristics so that they are also different to observations in other clusters; identifying natural groups within the data with aim to analyze groups instead of individual values (data reduction)
Assumptions
Representativeness of the sample, no large multicollinearity, no outliers
To limit multicollinearity …
Scaling the numbers, use distance measures, exclude highly correlated variables
A similarity can be measured by …
Distance measures (Minkowski like Euclidean, Mahalanobis), correlation coefficients
Distance measure measures …
Dissimilarity between two objects, large value means they are not similar
Hierarchical cluster technique means …
The final number of clusters is not fixed - agglomerative, divisive
Agglomerative clustering means …
Starts with every object being in own cluster
Divisive clustering means …
Starts with one cluster, ends with single clusters
Single linkage method is …
Good to detect outliers
Complete linkage methods is …
Sensitive to outliers
Average linkage method is …
Considers avg similarity of all individuals
Centroid linkage method is …
Consider differences between centroids
Ward’s method
Uses variance within clusters, good when equally sized clusters are expected, sensitive to outliers
Seed points are for …
creating clusters around them for when the amount of clusters is fixed - non-hierarchical
k-means clustering …
Calculates the similarity between the seeds and the objects, then assigns the objects