Hierarchical clustering Flashcards
Hierarchical clustering algorithm operates in _______ fashion and why
Hierarchical clustering algorithms typically operate in a greedy fashion, making locally optimal choices at each step (merging the closest clusters or spitting the largest clusters) without reconsidering previous steps.
Hierarchal clustering is __________-
divide and conquer clustering
Another name of agglomerative clustering
Bottom up approach
Another name of agglomerative clusetring
Top down approach
Hierarchical clustering can be used for what or cant’s be used for what
Hierarchical clustering can be used for outlier detection but not for finding missing values (NA) or detected fake values.
Hierarchical clustering is primarily used for ______ because ________
Hierarchical clustering is primarily used for exploration because it helps in understanding the natural grouping within data which can be very useful in exploratory data analysis.
Hierarchical clustering is _________ visualization
Dendogram visualization
In hierarchical clustering do we need to specify the number of clusters?
No need to specify the number of clusters in hierarchical clustering
How hierarchical clustering provides flexibility or not
It allows you to choose the number of clusters by cutting the dendrogram at different levels, providing flexibility to explore the data at different granularities.
Hierarchal clustering is deterministic or not
Hierarchal clustering is deterministic because it allows a fixed sequence of merging or splitting clusters based on defined criteria like distance
Linkage (definition and types)
Linkage is how to link the clusters
Linkage techniques are two types: Single linkage and complete linkage
Single linkage
* Another name
* Keyword
* Definition
* Formula
- Another name: Nearest neighbour method
- Keyword: shortest distance
- Definition: This linkage technique focused on the shortest distance between data points in each cluster.
Complete linkage
* Another name
* Keyword
* Definition
* Formula
- Another name: Farthest neighbour method
- Keyword: longest distance
- Definition: This linkage technique focused on the longest distance between data points in each cluster.
Agglomerative clustering keyword
Merging approach
Agglomerative clustering use which linkage
can use any linkage
Single linkage or complete linkage
Decisive clustering use which linkage
Decisive linkage use only complete linkage
Remember point in agglomerative clustering problem
Average linkage technique
Decisive clustering keyword
Splitting approach
How to do problem of decisive clustering
We create Minimal spanning tree (MST) based on dissimilar matrix
Minimal spanning tree characteristics (4)
It is a connected tree
No loops/ no closed circuits in the tree.
Each data point(node) in the tree is visited atleast once.
If ‘n’ nodes are present in the tree, then (n-1) edges are present or formed in the tree.
If there is n nodes in MST then
If ‘n’ nodes are present in the tree, then (n-1) edges are present or formed in the tree.
Remember point in decisive clustering problem
Explain number of levels in hierarchy in both agglomerative clustering and decisive clustering
Agglomerative clustering: If there are n observations, then there will be n-1 levels in the hierarchy.Since n−1 merges are required to combine n observations into a single cluster, the hierarchy has n−1 levels.
Decisive clustering: The number of levels depends on the way splits occur (e.g., binary splits may create more or fewer than n−1 levels).
Ward’s method
Merging cluster way.
In this technique, we minimise the increase in variance when merging clusters. Repeat this process iteratively until all data points are in a single cluster or until reach the desired number of clusters.
Similarity of two clusters is based on the increase in squared error when two clusters are merged.
(Similar to group average if distance between points is distance squared).
Less susceptible to noise and outliers.
Biased towards global clusters.
Hierarchical analogue of k-means
(can be used to initialise k-means)
In Divisive clustering, explain split clusters in terms of variance
Who is more sensitive to outliers (single linkage/ multiple linkage)
Single linkage is more sensitive to outliers
Single linkage characteristics
Another name of multiple linkage
Complete linkage or maximum linkage
Multiple linkage characteristics
Which one is more expensive (Divisive/ agglomerative)
Divisive clustering is computationally more expensive than agglomerative clustering because it requires considering all possible splits at each.
Which one uses more (Divisive / agglomerative)
Agglomerative is more commonly used whereas divisive is less commonly used due to its complexity.
If we solve question of divisive clustering using k means, what should be the value of k
k=2
Which is more efficient (k means/ hierarchal clustering)
k means