Clustering Flashcards

1
Q

What is clustering?

A

Clustering is an unsupervised machine learning algorithm to group together similar types of data points.
Eg: - Similar customers for an E-commerce platform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Elbow plot?

A

WCSS (Within clusters Sum of squares) vs No. of clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the disadvantage of KMeans?

A
  1. No. of clusters is not specified.
  2. Even if we are using elbow method sometimes the elbow is not clear from the plot
  3. Sensitive to outliers (As it is a distance based algorithm)
  4. Perform poor if data is not spherical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Full form of DBSCAN

A

Density Based Spatial Clustering Application with Noise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Epsilon

A

Radius of circle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

MinPts

A

Minimum no. of points in circle for it be called as a core point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Core point

A

Point which has more or equal MinPts in Epsilon.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Border point

A

Point which has less than MinPts in Epsilon but at least one core point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Noise point

A

Points other than core and border

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Advantages and disadvantages of DBSCAN

A

Advantages:
1. Robust to outliers
2. No need to specify no. of clusters
3. Can find arbitrary shaped clusters

Disadvantages:
1. Difficulty with varying density clusters
2. Cannot predict cluster for a new point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Types of Hierarchical Clustering

A
  1. Agglomerative Clustering (Bottom-up)
  2. Divisive Clustering (Top-down)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Proximity matrix

A

A square matrix that stores the distances between each pair of data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Types of agglomerative clustering

A
  1. Min (Single link)
  2. Max (Complete link)
  3. Average
  4. Ward
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Single Link

A

Finds the minimum distance between any two points of the cluster.
The distance is found out by taking the combination of all possible distances.

Should not be used in presence of outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Complete Link

A

Finds the maximum distance between any two points of the cluster.
The distance is found out by taking the combination of all possible distances.

Robust to outliers
However, in case of different sized cluster, the bigger cluster may break in smaller sections.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Group Average

A

Finds the average distance between every two points of the clusters.

Trade off between Single and complete link.

17
Q

Ward’s Method

A

The similarity of two clusters is based on the increase in squared error when two clusters are merged.

18
Q

How to find the ideal number of clustering in agglomerative clustering?

A

In dendrogram plot, where the vertical line (without any horizontal cut) is the longest.

19
Q

Advantages and disadvantages of Agglomerative

A

Advantages:
1. Widely applicable
2. Dendrogram

Disadvantages:
1. Depends on type of linkage & choice of dissimilarity.
2. Computationally expensive for larger datasets, due to distance calculation
3. Even outliers are forced to a cluster

20
Q

How is PCA different from clustering?

A

PCA is a low-dimensional representation of data, whereas Clustering is categorizing the data into homogeneous subgroups.

21
Q

Dendrogram

A

Tree like visual representation of the observation

22
Q

Can we cluster features on the basis of observations to discover subgroups among of the feature

23
Q

What is the criteria to determine good cluster in K-Means?

A

WCSS - With in cluster sum of squared distance (Euclidean)

24
Q

Centroid linkage

A

Dissimilarity between centroid of two clusters.

Can result in inversion

25
Q

Choice of dissimilarity measure

A
  1. Euclidean distance
  2. Correlation based distance - Two observations are similar if they are highly correlated
    It focusses on shape of observation rather than their magnitudes.
26
Q

Should the observations first be standardized in any way before clustering?

A

Depends on the problem we are trying to solve.

27
Q

For a new point, how different clustering algorithm works?

A

K-Means: Yes - Using centroid of initial clusters
DBSCAN: No, need to re-run the algorithm
Agglomerative Clustering: No, need to re-run the algorithm