Clustering Flashcards
What is the disadvantage of KMeans?
- No. of clusters is not specified.
- Even if we are using elbow method sometimes the elbow is not clear from the plot
- Sensitive to outliers (As it is a distance based algorithm)
- Perform poor if data is not spherical
Full form of DBSCAN
Density Based Spatial Clustering Application with Noise
Epsilon
Radius of circle
MinPts
Minimum no. of points in circle for it be called as a core point.
Core point
Point which has more or equal MinPts in Epsilon.
Border point
Point which has less than MinPts in Epsilon but at least one core point.
Noise point
Points other than core and border
Advantages and disadvantages of DBSCAN
Advantages:
1. Robust to outliers
2. No need to specify no. of clusters
3. Can find arbitrary shaped clusters
Disadvantages:
1. Difficulty with varying density clusters
2. Cannot predict cluster for a new point
Types of Hierarchical Clustering
- Agglomerative Clustering (Bottom-up)
- Divisive Clustering (Top-down)
Proximity matrix
A square matrix that stores the distances between each pair of data points
Types of agglomerative clustering
- Min (Single link)
- Max (Complete link)
- Average
- Ward
Single Link
Finds the minimum distance between any two points of the cluster.
The distance is found out by taking the combination of all possible distances.
Should not be used in presence of outliers.
Complete Link
Finds the maximum distance between any two points of the cluster.
The distance is found out by taking the combination of all possible distances.
Robust to outliers
However, in case of different sized cluster, the bigger cluster may break in smaller sections.
Group Average
Finds the average distance between every two points of the clusters.
Trade off between Single and complete link.
Ward’s Method
The similarity of two clusters is based on the increase in squared error when two clusters are merged.