Clustering Flashcards
What is clustering?
Clustering is an unsupervised machine learning algorithm to group together similar types of data points.
Eg: - Similar customers for an E-commerce platform
What is Elbow plot?
WCSS (Within clusters Sum of squares) vs No. of clusters
What is the disadvantage of KMeans?
- No. of clusters is not specified.
- Even if we are using elbow method sometimes the elbow is not clear from the plot
- Sensitive to outliers (As it is a distance based algorithm)
- Perform poor if data is not spherical
Full form of DBSCAN
Density Based Spatial Clustering Application with Noise
Epsilon
Radius of circle
MinPts
Minimum no. of points in circle for it be called as a core point.
Core point
Point which has more or equal MinPts in Epsilon.
Border point
Point which has less than MinPts in Epsilon but at least one core point.
Noise point
Points other than core and border
Advantages and disadvantages of DBSCAN
Advantages:
1. Robust to outliers
2. No need to specify no. of clusters
3. Can find arbitrary shaped clusters
Disadvantages:
1. Difficulty with varying density clusters
2. Cannot predict cluster for a new point
Types of Hierarchical Clustering
- Agglomerative Clustering (Bottom-up)
- Divisive Clustering (Top-down)
Proximity matrix
A square matrix that stores the distances between each pair of data points
Types of agglomerative clustering
- Min (Single link)
- Max (Complete link)
- Average
- Ward
Single Link
Finds the minimum distance between any two points of the cluster.
The distance is found out by taking the combination of all possible distances.
Should not be used in presence of outliers.
Complete Link
Finds the maximum distance between any two points of the cluster.
The distance is found out by taking the combination of all possible distances.
Robust to outliers
However, in case of different sized cluster, the bigger cluster may break in smaller sections.
Group Average
Finds the average distance between every two points of the clusters.
Trade off between Single and complete link.
Ward’s Method
The similarity of two clusters is based on the increase in squared error when two clusters are merged.
How to find the ideal number of clustering in agglomerative clustering?
In dendrogram plot, where the vertical line (without any horizontal cut) is the longest.
Advantages and disadvantages of Agglomerative
Advantages:
1. Widely applicable
2. Dendrogram
Disadvantages:
1. Depends on type of linkage & choice of dissimilarity.
2. Computationally expensive for larger datasets, due to distance calculation
3. Even outliers are forced to a cluster
How is PCA different from clustering?
PCA is a low-dimensional representation of data, whereas Clustering is categorizing the data into homogeneous subgroups.
Dendrogram
Tree like visual representation of the observation
Can we cluster features on the basis of observations to discover subgroups among of the feature
Yes
What is the criteria to determine good cluster in K-Means?
WCSS - With in cluster sum of squared distance (Euclidean)
Centroid linkage
Dissimilarity between centroid of two clusters.
Can result in inversion
Choice of dissimilarity measure
- Euclidean distance
- Correlation based distance - Two observations are similar if they are highly correlated
It focusses on shape of observation rather than their magnitudes.
Should the observations first be standardized in any way before clustering?
Depends on the problem we are trying to solve.
For a new point, how different clustering algorithm works?
K-Means: Yes - Using centroid of initial clusters
DBSCAN: No, need to re-run the algorithm
Agglomerative Clustering: No, need to re-run the algorithm