Chapter 5. Clustering Flashcards
How does clustering work?
P 192
clustering, attempts to group objects together based on similarity. Clustering achieves this without using any labels, comparing how similar the data for one observation is to data for other observations and groups.
Before we perform clustering, we will ____.
P 194
reduce the dimensionality of the data -using PCA-.
The clustering algorithms generally perform better, both in terms of time and clustering accuracy, on dimensionality-reduced datasets. P 205
What are the three major clustering algorithms?
P 195
- k-means
- hierarchical clustering
- DBSCAN
Do we choose the number of clusters before using K-means?
P 196
Yes. In k-means clustering, we specify the number of desired clusters k, and the algorithm will assign each observation to exactly one of these k clusters.
How does K-means work?
P 196
The algorithm optimizes the groups by minimizing the within-cluster variation (also known as inertia) such that the sum of the within-cluster variations across all k clusters is as small as possible.
Typically, the k-means algorithm does several runs (n_init) and chooses the run that has the best separation, defined as the lowest total sum of within-cluster variations across all k clusters.
In K-means, the inertia decreases as the number of clusters ____.
P 197
Increases
This makes sense; The more clusters we have, the greater the homogeneity among observations within each cluster.
“If PCA does a good job of capturing the underlying structure in the data as compactly as possible, the clustering algorithm (K-means) will have an easy time grouping similar instances together, regardless of whether the clustering happens on just a fraction of the principal components or many more.” What does this mean?
P 203
In other words, clustering should perform just as well using 10 or 50 principal components as it does, using one hundred or several hundred principal components.
As we see in the example, the change in overall accuracy when PCA is used is minimal (about 3% for number of components in range(10,749)) but without PCA, changing the number of features used for K-means, results in drastic change in overall accuracy