K-means And Hierarchical Clustering Flashcards
What is the purpose of K-means and Hierarchical clustering?
Both are unsupervised learning techniques used to find groups or clusters of data points. Their main goal is to partition data into clusters such that similar data points are grouped together.
Is it best for supervised or unsupervised learning?
They are best for unsupervised learning, finding clustering tendencies without knowing the true label (y)
What are they both based on?
They are both based on euclidean distance
What is the three different clustering techniques?
Well-separated, Center-based and Contiguity-based
What are the two elaborated clustering techniques?
Densit-based and Conceptual
What is well-separated clustering technique?
Each point is closer to all in its cluster than any other cluster. Similiar to maximum-linkage of hierarchical clustering, assuming well-separated clusters
What is center-based clustering technique?
Each point is closer to the center of its cluster than to the center of any other cluster. K-means and Ward-clustering, takes a center-based apporach to finding clusters
What is contiguity-based clustering technique?
Each point is closer to at least one point in its cluster than to any point in another cluster. Hierarchical clustering with minimum-linkage takes a contiguity-based approach to finding clusters.
What is Density-based clustering technique?
Clusters are regions of high density separated by regions of low density. The Gaussian mixture model, takes a density-based approach to find clusters
What is conceptual clustering technique?
Points in a cluster share some general property that is derived from the entire set of points.
What is the goal of K-means?
The goal of K-means is to partition the dataset into K clusters, where the distance between data points within a cluster is smaller than the distance between data points in different clusters. It aims to minimize the sum of squared distances (error) between points and their assigned cluster centroids.
How are the μ_k update?
In K-means, the centroids μ_k (the mean of each cluster) are updated by calculating the mean of all the points assigned to each cluster. The centroid is the point that minimizes the distance to all points within the cluster.
what is μ_k?
The center or mean of a cluster (or another reference point) in the same space
what is z_(ik)?
binary vector of each x_i, determining whether they belong to the given cluster (1 = yes, 0 = no)
What is euclidean distance?
Euclidean distance is a measure of the straight-line distance between two points in Euclidean space.
|| x_i - μ_k ||
How is the distance to μ_k determined?
If the distance of x_i is lowest to the given cluster k, then z_(ik) is 1 else it is 0 and not a part of this cluster and will therefore not be included in the updated calculations for μ_k (centroid)
How do we know we have set optimal number of k-clusters?
By the sum of squared error, based on the euclidean distance and x_i. Calculating the squared error for each cluster and then summing them all to get overall sum of squared error. We need low error but not overfitted with too many clusters.
Issues with K-means?
- Sensitivity to initialization: Different initial centroids can lead to different results.
- Requires predefined K: The number of clusters K needs to be specified in advance, which can be difficult to determine.
- Sensitive to outliers: Outliers can distort the centroids significantly.
- Assumes spherical clusters: Works best with well-separated and roughly spherical clusters.
How to ensure centroids are well spread out?
To ensure centroids are well spread out, use better initialization methods like the farthest-first initialization. These methods select initial centroids that are far apart from each other, reducing the likelihood of poor cluster assignments.
What is hierarchical clustering?
Hierarchical clustering is an unsupervised method that builds a hierarchy of clusters by iteratively merging or splitting clusters. It doesn’t require specifying the number of clusters in advance and produces a dendrogram, a tree-like diagram that shows the relationships between clusters.
What is minimum linkage?
Minimum (or single) linkage is a method in hierarchical clustering where the distance between two clusters is defined as the distance between the closest pair of observations in the two clusters.
What is maximum linkage?
Maximum (or complete) linkage in hierarchical clustering defines the distance between two clusters as the distance between the most distant pair of observations in the two clusters.
What is average linkage?
Average linkage calculates the distance between two clusters as the average distance between all pairs of points in the two clusters.
What is the Ward method?
The Ward method is a hierarchical clustering method that minimizes the variance within clusters. It merges the two clusters whose combination results in the smallest increase in the total sum of squared distances.