Unsupervised Learning: Clustering Flashcards
What are Gaussian mixture models?
An alternative clustering technique for when you suspect the data clusters are “long” in space rather than circular-like.
KMeans: main syntax for creating the model, fitting, and predicting? (3 lines)
- What are the arguments for the model?
- What does predict actually output?
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km. fit(X)
km. predict(X) # outputs assigned cluster # for each observation
KMeans: Additional main methods and what they output?
km = KMeans(n_clusters=3)
km.fit(X)
km. cluster_centers_ # coordinates of centroids
km. inertia_
Is a cluster centroid one of the data points in the cluster?
No.
In KMeans clustering, what do the parameters n_init=10 and max_iter=300 mean?
max_iter=300 is how many iterations to do before “converging” on a solution
n_init=10 means it repeats the entire process 10 times and chooses the best one, based on lowest inertia score
In KMeans clustering, what are the two primary measures of model fit?
What is their syntax?
Inertia and silhouette score.
km.inertia_ # method of the model
from sklearn.metrics import silhouette_score
silhouette_score(X, km.predict(X)) # additional measure, need to feed it the predictors and the outcome (cluster numbers)
What is the inertia measure in KMeans clustering?
How is it calculated?
What is its range? Is lower or higher better?
What is it used for and why?
What is it NOT used for and why?
Inertia measures the compactness of each cluster entirely within itself.
Calculated by measuring the average combined squared distance from each point to its centroid.
Ranges from 0 to infinity. Lower is better.
Used by the KMeans algorithm to optimize cluster centroid locations to achieve lowest inertia, provided an already-predetermined N clusters.
NOT used to figure out N clusters in the first place, because: more clusters = lower inertia = eventually the unsupervised learning version of “overfitting” (e.g., if each data point is its own cluster).
What is the silhouette measure in KMeans clustering and how is it calculated?
What is its range? Is lower or higher better?
What is it used for and why?
Silhouette score measures the tightness of each cluster, taking other clusters into account. Intra-cluster distances vs. inter-cluster distances (to the nearest cluster only).
Range -1 to 1.
Higher is better.
Unlike inertia, can be used to determine the optimal N clusters, because: more clusters = NOT necessarily higher silhouette score = silhouette score likely peaks somewhere reasonable.
What’s a good way to visualize how close specific data points (e.g., customers) are to each other in a dataset with many features?
Perform PCA to reduce to at most 3 dimensions, then KMeans clustering.
After reducing dimensions and clustering, what is one way to make sense of what each cluster means? (both conceptually and syntax-wise)
Back-translate each cluster’s centroid into its original features/dimensions.
Given centroids (the output of km.cluster_centers_), do:
standard_scaler.inverse_transform(pca.inverse_transform(centroids))
Name 3 examples of unsupervised ML.
- Dimensionality reduction / PCA
- Clustering
- Outlier detection (because it’s trying to label/categorize observations as “outlier” vs “not outlier”, without the benefit of having the “right answer” from any kind of training dataset)