Unsupervised Learning: Clustering Flashcards

1
Q

What are Gaussian mixture models?

A

An alternative clustering technique for when you suspect the data clusters are “long” in space rather than circular-like.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

KMeans: main syntax for creating the model, fitting, and predicting? (3 lines)

  • What are the arguments for the model?
  • What does predict actually output?
A

from sklearn.cluster import KMeans

km = KMeans(n_clusters=3)

km. fit(X)
km. predict(X) # outputs assigned cluster # for each observation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

KMeans: Additional main methods and what they output?

A

km = KMeans(n_clusters=3)
km.fit(X)

km. cluster_centers_ # coordinates of centroids
km. inertia_

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Is a cluster centroid one of the data points in the cluster?

A

No.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In KMeans clustering, what do the parameters n_init=10 and max_iter=300 mean?

A

max_iter=300 is how many iterations to do before “converging” on a solution

n_init=10 means it repeats the entire process 10 times and chooses the best one, based on lowest inertia score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In KMeans clustering, what are the two primary measures of model fit?

What is their syntax?

A

Inertia and silhouette score.

km.inertia_ # method of the model

from sklearn.metrics import silhouette_score
silhouette_score(X, km.predict(X)) # additional measure, need to feed it the predictors and the outcome (cluster numbers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the inertia measure in KMeans clustering?
How is it calculated?
What is its range? Is lower or higher better?
What is it used for and why?
What is it NOT used for and why?

A

Inertia measures the compactness of each cluster entirely within itself.

Calculated by measuring the average combined squared distance from each point to its centroid.

Ranges from 0 to infinity. Lower is better.

Used by the KMeans algorithm to optimize cluster centroid locations to achieve lowest inertia, provided an already-predetermined N clusters.

NOT used to figure out N clusters in the first place, because: more clusters = lower inertia = eventually the unsupervised learning version of “overfitting” (e.g., if each data point is its own cluster).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the silhouette measure in KMeans clustering and how is it calculated?
What is its range? Is lower or higher better?
What is it used for and why?

A

Silhouette score measures the tightness of each cluster, taking other clusters into account. Intra-cluster distances vs. inter-cluster distances (to the nearest cluster only).

Range -1 to 1.

Higher is better.

Unlike inertia, can be used to determine the optimal N clusters, because: more clusters = NOT necessarily higher silhouette score = silhouette score likely peaks somewhere reasonable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What’s a good way to visualize how close specific data points (e.g., customers) are to each other in a dataset with many features?

A

Perform PCA to reduce to at most 3 dimensions, then KMeans clustering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

After reducing dimensions and clustering, what is one way to make sense of what each cluster means? (both conceptually and syntax-wise)

A

Back-translate each cluster’s centroid into its original features/dimensions.

Given centroids (the output of km.cluster_centers_), do:

standard_scaler.inverse_transform(pca.inverse_transform(centroids))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Name 3 examples of unsupervised ML.

A
  1. Dimensionality reduction / PCA
  2. Clustering
  3. Outlier detection (because it’s trying to label/categorize observations as “outlier” vs “not outlier”, without the benefit of having the “right answer” from any kind of training dataset)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly