Unsupervised Machine Learning Flashcards

Question

Agglomerative clustering

Answer 1

works by first assigning every point to its own cluster, then progressively combining clusters based on intercluster distance.

Answer 2

you specify a desired number of clusters or a distance threshold, which is the linkage distance

Answer 3

different ways to measure the distances that determine whether or not to merge the clusters.

Answer 4

Single: The minimum pairwise distance between clusters. Complete: The maximum pairwise distance between clusters. Average: The distance between each cluster’s centroid and other clusters’ centroids. Ward: This is not a distance measurement. Instead, it merges the two clusters whose merging will result in the lowest inertia.

Answer 5

1. You reach a specified number of clusters. 2. You reach an intercluster distance threshold (clusters that are separated by more than this distance are too far from each other and will not be merged).

Answer 6

n_clusters: the number of clusters you want in your final model. linkage: the linkage method to use to determine which clusters to merge. affinity: the metric used to calculate the distance between clusters. Default = euclidean distance. distance_threshold: the distance above which clusters will not be merged

Answer 7

scales reasonably well, can detect clusters of various shapes.

Answer 8

1. **clearly identifiable** clusters (within each cluster or intracluster, the **points are close to each other**) 1. Each cluster is **well separated** from **other clusters**. (between the clusters themselves or intercluster, you want **lots of empty space**)

Answer 9

1. Inertia 2. Silhouette Score

Answer 10

Inertia is a metric used in K-Means clustering to **measure the quality** of the clustering. It represents the average squared distance between **each data point** and its **assigned cluster centroid** **Lower Inertia, Better Clustering** The goal of K-Means is to minimize inertia. A lower inertia indicates that the **data** points are more **tightly clustered **around** their respective centroids**, suggesting a better clustering solution.

Answer 11

- **more precise** evaluation metric than inertia because it also takes into account the separation between clusters. Silhouette score is defined as the mean of the silhouette coefficients of all the observations in the model. Provides insight as to what the **optimal value for K** should be, and uses both **intracluster and intercluster** measurements in its calculation

Answer 12

- lower = better (less distance between each observation and its nearest centroid. - 0 = useless (all points are overlapping each other in the center).

Answer 13

- helps us to decide on the **optimal k value.** - We do this by using the **elbow method.**

Answer 14

Plot of **inertia vs K-values** (1,2,3…etc). - A good way of choosing an **optimal k value** is to find the elbow of the curve. - This is the value of k at which the decrease in inertia starts to level off.

Answer 15

1 = optimal (an observation sit nicely within its own cluster and well separated from other clusters). 0 = an observation is on the boundary between clusters. -1 = in the wrong cluster.

Unsupervised Machine Learning Flashcards

(39 cards)