5: PCA and Cluster Analysis Flashcards
What is the advantage of a dendrogram?
We can evaluate the clusterings obtained for each possible number of clusters from 1 to n.
Which two clustering methods are the two most common?
Hierarchical and K-means clustering.
What is the idea of K-means clustering?
The idea is to partition clusters by assigning each data point to the nearest centroid, based on a pre-defined number of clusters (k). It aims to minimize the sum of squared distances between data points and their respective centroids.
What are two major disadvantages of K-means clustering?
1) The algorithm will force all observations into a cluster, regardless of hor “far” that observation is from some other observations, 2) we need to pre-specifiy the number of clusters.
What is the difference between agglomerative and divisive clustering?
Can agglomerative and divisive clustering be used for both hierachical and K-means?
No, divisive and agglomerative clustering are two distinct approaches to hierarchical clustering. K-means does not fit into either category.
Which two properties need to be fulfilled in K-means?
1) All observations need to belong to at least one cluster, 2) clusters must be non-overlapping.
What is the objective in K-means?
To partition the observations into K clusters such that the total within-cluster variation, summed over all K, clusters, is as small as possible.
Is cluster analysis a supervised or unsupervised learning method?
Unsupervise (for its lack of a class label or a quantitative response variable).
What is common for all linkage methods (single, complete, average)?
That we base the clusters on the minimum distance.
Why is it important to standardize the variables before performing cluster analysis?
Standardization prevents variables with larger scales from dominating how clusters are defined. It alllows all variables to be considered by the algorithm with equal importance.