Chapter 10 - Clustering Flashcards

1
Q

Clustering - overviewa and types we discussed in class

A

unsupervised learning, used to help understand latent trends in data. we assign a class to each sample in the data matrix, however this class is not an output variable. Should we scale before clustering? Does Euclidean distance capture dissimilarity? type - K-means, Hierarchical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

K-means

A

different from classification because there is no output variable (e.g., knowing height and weight but not gender)/ K is the number of clusters and must be fixed in advance. The goal of K-means is to {minimize the dissimilarity,maximize the similarity} within each cluster. min sum W(C_L) . Follow the process: 1) assign each sample to a K 2a) compute the centroid for each group 2b) reassign groups to nearest centroid [iterate until centroids don’t move].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Properties of K-means algorithm (4)

A

1) will always converge to a local minimum of W(C_L), not necessarily a global minimum
2) each step minimizes the distance to the centroid
3) each initialization will yield a different local min
4) in practice, we choose the k-means algorithm run which minimizes the objective function (you can run k-means several times)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Hierachical clustering

A

agglomerative. we do not have to commit to a choice of K clusters. bottom up clustering. the output of hierarchical clustering is a dendogram. y axis tells you distance between clustered groups. Hierarchical clustering is not always appropriate (gender and nationality are not nested or hierarchical).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

4 types of linkage used in hierarchical clustering (and any potential downfalls of each type)

A

1) complete linkage - maximal intercluster dissimilarity
2) single linkage - minimal intercluster dissimilarity (suffers from chaining phenomenon)
3) average linkage - mean intercluster dissimilarity
4) centroid linkage - dissimilarity between centroid for cluster A and cluster B (can lead to inversions)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Choice of dissimilarity measure

A

the default is euclidean distance. correlation distance would group customers that purchase similar things, not just magnitude. mahalanobis distance: removes the effect of two different perturbations in the data that are correlated between many and only a few factors (weights both perturbations the same).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly