Cluster Analysis Flashcards
Is cluster analysis a supervised or an unsupervised learning method?
Unsupervised learning method
Contrast cluster analysis with PCA
PCA: looks for a low dimensional representation that explains most of the variance
Cluster Analysis: tries to group the observations into a small number of groups of similar observations
Give examples of application of cluster analysis
. Marketing
. Medical
. Actuarial modeling
Give two clustering methods
. K-means
. Hierarchical
Define K-means clustering
Group the observations in a data set into K disjoint clusters in which the observations are relatively homogenous.
We decide the number of cluster in advance, they are exhaustive (every observation belongs to one cluster) and mutually exclusive.
The clusters are selected to minimize the total dissimilarities between points within the cluster
What is the centroid of a cluster and how can you compute it?
Point whose coordinates are the means of the coordinates of the cluster
Formula: ( (sum(xi)/n ; sum(yi)/n )
What is the algorithm for K-means clustering?
- Split the observations arbitrarily into K clusters
- Calculate the centroids for each cluster
- Create new clusters by associating each points with the nearest centroid
- Repeat steps 2 and 3 until the clusters don’t change
What is the dissimilarity function for k-means clustering?
Twice the Euclidean distance of points from the centroid.
Assigning points to the closest centroid can only reduce the total dissimilarity.
Define hierarchical clustering
Consists of a series of fusions of observations results in bigger clusters containing smaller clusters containing smaller clusters.
Does not require to specify the number of clusters in advance
Describe bottom-up / agglomerative clustering
The algorithm starts with n clusters (1 for each observation) and iteratively fuses the two most similar clusters together until all points are in one cluster.
How do you decide how many clusters should be used in bottom-up clustering?
A dendrogram is produced. The number of clusters is determined by deciding at what height to cut the graph.
What is a dissimilarity measure?
A formula that measures how different two points are.
What is the most common dissimilarity measure?
Euclidean distance (square root of the sum of square differences between coordinates)
In what type of cluster analysis do we square the Euclidean distance?
K-means only
Why can’t we square the Euclidean distance
in hierarchical clustering, a dendrogram in which the height of each fusion is the dissimilarity measure. We don’t want to spoil the scale of the graph by squaring the distance.