L13 - Hierarchical Structuring and DBSCAN Flashcards
What is hierarchical clustering?
- An unsupervised learning method for clustering similar data points.
- The goal is to represent degrees of similarity within data.
- Enables the partition of data at various levels.
What is the structure of hierarchical clustering?
- Clusters and sub-clusters.
- Results in tree-like structure.
What are Dendrograms?
Tree diagrams that show a hierarchy of clusters, where each node represents a cluster.
In Dendrograms, what are leaves called?
Singletons
What are the 2 approaches of creating a Dendrogram?
- Bottom up -> Agglomerative
- Top down -> Divisive
How is a Dendrogram created using the Agglomerative method?
- Starting with a set of individual data points
- Iteratively:
- Create distance matrix between points
- Merge close ones
How is a dendrogram created using Divisive method?
- Start with a large set of clustered data points
- Iteratively split the data points
What is the time complexity of Hierarchical clustering?
O(N^3)
When hierarchical clustering, what is the only thing we need to be able to create?
Distance matrix between points
Why can’t we use Brute force to calculate Distance Matrix?
Too many points means too high complexity
What are the 5 methods we can use to calculate distance between clusters?
- Simple Linkage -> Distance defined as the distance between 2 closest data points of 2 separate clusters.
- Complete Linkage -> Distance defined as the distance between 2 furthest data points of 2 separate clusters.
- Average Linkage -> Distance is defined as the average distance between all members of each clusters.
- Centroid Linkage -> Distance defined by the distance between the centroids of each cluster.
- Wards Method -> Join clusters only if it reduces the total distance from he centroids.
What is an issue with each of the 5 distance calculation methods?
- Simple Linkage -> Can lead to a long chain of clusters
- Complete Linkage -> Often breaks large clusters into 2 or more
- Average Linkage -> High computational cost and complexity due to every data point needing to be visited.
- Centroid Linkage -> Biased towards spherical clusters
- Wards Method -> Biased towards spherical clusters
What method do we use to know if we have a good cluster count?
Elbow method
What is DBSCAN?
Not to compare every point (such as hierarchical clustering), but to have a density threshold around each point, and if the threshold area contains enough other data points, the point can be known as a Dense Point.
What are the 2 hyper parameters of DBSCAN?
Epsilon -> Radius of density threshold
MinPts -> Min no of points need in threshold to be considered a dense point