L13 - Hierarchical Structuring and DBSCAN Flashcards
What is hierarchical clustering?
- An unsupervised learning method for clustering similar data points.
- The goal is to represent degrees of similarity within data.
- Enables the partition of data at various levels.
What is the structure of hierarchical clustering?
- Clusters and sub-clusters.
- Results in tree-like structure.
What are Dendrograms?
Tree diagrams that show a hierarchy of clusters, where each node represents a cluster.
In Dendrograms, what are leaves called?
Singletons
What are the 2 approaches of creating a Dendrogram?
- Bottom up -> Agglomerative
- Top down -> Divisive
How is a Dendrogram created using the Agglomerative method?
- Starting with a set of individual data points
- Iteratively:
- Create distance matrix between points
- Merge close ones
How is a dendrogram created using Divisive method?
- Start with a large set of clustered data points
- Iteratively split the data points
What is the time complexity of Hierarchical clustering?
O(N^3)
When hierarchical clustering, what is the only thing we need to be able to create?
Distance matrix between points
Why can’t we use Brute force to calculate Distance Matrix?
Too many points means too high complexity
What are the 5 methods we can use to calculate distance between clusters?
- Simple Linkage -> Distance defined as the distance between 2 closest data points of 2 separate clusters.
- Complete Linkage -> Distance defined as the distance between 2 furthest data points of 2 separate clusters.
- Average Linkage -> Distance is defined as the average distance between all members of each clusters.
- Centroid Linkage -> Distance defined by the distance between the centroids of each cluster.
- Wards Method -> Join clusters only if it reduces the total distance from he centroids.
What is an issue with each of the 5 distance calculation methods?
- Simple Linkage -> Can lead to a long chain of clusters
- Complete Linkage -> Often breaks large clusters into 2 or more
- Average Linkage -> High computational cost and complexity due to every data point needing to be visited.
- Centroid Linkage -> Biased towards spherical clusters
- Wards Method -> Biased towards spherical clusters
What method do we use to know if we have a good cluster count?
Elbow method
What is DBSCAN?
Not to compare every point (such as hierarchical clustering), but to have a density threshold around each point, and if the threshold area contains enough other data points, the point can be known as a Dense Point.
What are the 2 hyper parameters of DBSCAN?
Epsilon -> Radius of density threshold
MinPts -> Min no of points need in threshold to be considered a dense point
How are points categorised in DBSCAN?
Core point -> If a point has more than a specified number MinPts in its Epsilon. These points are in the interior of a cluster
Border point -> I the point has fewer points that MinPts in it’s Epsilon, but is in the neighbourhood of a core point
Noise point -> Any point that is not a core or border point
What is the result of DBSCAN?
All points within a cluster can reach one another through steps of size Epsilon
What are the pros and cons of DBSCAN?
Pros -> Resistant to noise. Can handle clusters of different shapes and sizes.
Cons -> Eps and MinPts interact and can be hard to specify.
What are 2 limitations of DBSCAN?
- Struggles to cluster with varying densities
- Highly dependent on Eps and MinPts selection