Clustering Flashcards

1
Q

What is the disadvantage of KMeans?

A
  1. No. of clusters is not specified.
  2. Even if we are using elbow method sometimes the elbow is not clear from the plot
  3. Sensitive to outliers (As it is a distance based algorithm)
  4. Perform poor if data is not spherical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Full form of DBSCAN

A

Density Based Spatial Clustering Application with Noise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Epsilon

A

Radius of circle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

MinPts

A

Minimum no. of points in circle for it be called as a core point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Core point

A

Point which has more or equal MinPts in Epsilon.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Border point

A

Point which has less than MinPts in Epsilon but at least one core point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Noise point

A

Points other than core and border

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Advantages and disadvantages of DBSCAN

A

Advantages:
1. Robust to outliers
2. No need to specify no. of clusters
3. Can find arbitrary shaped clusters

Disadvantages:
1. Difficulty with varying density clusters
2. Cannot predict cluster for a new point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Types of Hierarchical Clustering

A
  1. Agglomerative Clustering (Bottom-up)
  2. Divisive Clustering (Top-down)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Proximity matrix

A

A square matrix that stores the distances between each pair of data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Types of agglomerative clustering

A
  1. Min (Single link)
  2. Max (Complete link)
  3. Average
  4. Ward
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Single Link

A

Finds the minimum distance between any two points of the cluster.
The distance is found out by taking the combination of all possible distances.

Should not be used in presence of outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Complete Link

A

Finds the maximum distance between any two points of the cluster.
The distance is found out by taking the combination of all possible distances.

Robust to outliers
However, in case of different sized cluster, the bigger cluster may break in smaller sections.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Group Average

A

Finds the average distance between every two points of the clusters.

Trade off between Single and complete link.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Ward’s Method

A

The similarity of two clusters is based on the increase in squared error when two clusters are merged.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to find the ideal number of clustering in agglomerative clustering?

A

In dendrogram plot, where the vertical line (without any horizontal cut) is the longest.

17
Q

Advantages and disadvantages of Agglomerative

A

Advantages:
1. Widely applicable
2. Dendrogram

Disadvantages:
1. Depends on type of linkage & choice of dissimilarity.
2. Computationally expensive for larger datasets, due to distance calculation
3. Even outliers are forced to a cluster

18
Q

How is PCA different from clustering?

A

PCA is a low-dimensional representation of data, whereas Clustering is categorizing the data into homogeneous subgroups.

19
Q

Dendrogram

A

Tree like visual representation of the observation

20
Q

Can we cluster features on the basis of observations to discover subgroups among of the feature

A

Yes

21
Q

What is the criteria to determine good cluster in K-Means?

A

WCSS - With in cluster sum of squared distance (Euclidean)

22
Q

Centroid linkage

A

Dissimilarity between centroid of two clusters.

Can result in inversion

23
Q

Choice of dissimilarity measure

A
  1. Euclidean distance
  2. Correlation based distance - Two observations are similar if they are highly correlated
    It focusses on shape of observation rather than their magnitudes.
24
Q

Should the observations first be standardized in any way before clustering?

A

Depends on the problem we are trying to solve.

25
Q

For a new point, how different clustering algorithm works?

A

K-Means: Yes - Using centroid of initial clusters
DBSCAN: No, need to re-run the algorithm
Agglomerative Clustering: No, need to re-run the algorithm