Clustering Flashcards by Harsh Raj

What is clustering?

Clustering is an unsupervised machine learning algorithm to group together similar types of data points.
Eg: - Similar customers for an E-commerce platform

How well did you know this?

Not at all

Perfectly

What is Elbow plot?

WCSS (Within clusters Sum of squares) vs No. of clusters

How well did you know this?

Not at all

Perfectly

What is the disadvantage of KMeans?

No. of clusters is not specified.
Even if we are using elbow method sometimes the elbow is not clear from the plot
Sensitive to outliers (As it is a distance based algorithm)
Perform poor if data is not spherical

How well did you know this?

Not at all

Perfectly

Full form of DBSCAN

Density Based Spatial Clustering Application with Noise

How well did you know this?

Not at all

Perfectly

Epsilon

Radius of circle

How well did you know this?

Not at all

Perfectly

MinPts

Minimum no. of points in circle for it be called as a core point.

How well did you know this?

Not at all

Perfectly

Core point

Point which has more or equal MinPts in Epsilon.

How well did you know this?

Not at all

Perfectly

Border point

Point which has less than MinPts in Epsilon but at least one core point.

How well did you know this?

Not at all

Perfectly

Noise point

Points other than core and border

How well did you know this?

Not at all

Perfectly

Advantages and disadvantages of DBSCAN

Advantages:
1. Robust to outliers
2. No need to specify no. of clusters
3. Can find arbitrary shaped clusters

Disadvantages:
1. Difficulty with varying density clusters
2. Cannot predict cluster for a new point

How well did you know this?

Not at all

Perfectly

Types of Hierarchical Clustering

Agglomerative Clustering (Bottom-up)
Divisive Clustering (Top-down)

How well did you know this?

Not at all

Perfectly

Proximity matrix

A square matrix that stores the distances between each pair of data points

How well did you know this?

Not at all

Perfectly

Types of agglomerative clustering

Min (Single link)
Max (Complete link)
Average
Ward

How well did you know this?

Not at all

Perfectly

Single Link

Finds the minimum distance between any two points of the cluster.
The distance is found out by taking the combination of all possible distances.

Should not be used in presence of outliers.

How well did you know this?

Not at all

Perfectly

Complete Link

Finds the maximum distance between any two points of the cluster.
The distance is found out by taking the combination of all possible distances.

Robust to outliers
However, in case of different sized cluster, the bigger cluster may break in smaller sections.

How well did you know this?

Not at all

Perfectly

Group Average

Study These Flashcards

Finds the average distance between every two points of the clusters.

Trade off between Single and complete link.

Ward’s Method

Study These Flashcards

The similarity of two clusters is based on the increase in squared error when two clusters are merged.

How to find the ideal number of clustering in agglomerative clustering?

Study These Flashcards

In dendrogram plot, where the vertical line (without any horizontal cut) is the longest.

Advantages and disadvantages of Agglomerative

Study These Flashcards

Advantages:
1. Widely applicable
2. Dendrogram

Disadvantages:
1. Depends on type of linkage & choice of dissimilarity.
2. Computationally expensive for larger datasets, due to distance calculation
3. Even outliers are forced to a cluster

How is PCA different from clustering?

Study These Flashcards

PCA is a low-dimensional representation of data, whereas Clustering is categorizing the data into homogeneous subgroups.

Dendrogram

Study These Flashcards

Tree like visual representation of the observation

Can we cluster features on the basis of observations to discover subgroups among of the feature

Study These Flashcards

Yes

What is the criteria to determine good cluster in K-Means?

Study These Flashcards

WCSS - With in cluster sum of squared distance (Euclidean)

Centroid linkage

Study These Flashcards

Dissimilarity between centroid of two clusters.

Can result in inversion

Choice of dissimilarity measure

1. Euclidean distance 2. Correlation based distance - Two observations are similar if they are highly correlated It focusses on shape of observation rather than their magnitudes.

Should the observations first be standardized in any way before clustering?

Depends on the problem we are trying to solve.

For a new point, how different clustering algorithm works?

K-Means: Yes - Using centroid of initial clusters DBSCAN: No, need to re-run the algorithm Agglomerative Clustering: No, need to re-run the algorithm

Clustering Flashcards

(27 cards)