L13 - Hierarchical Structuring and DBSCAN Flashcards

Question 1

Q

What is hierarchical clustering?

Answer

A

An unsupervised learning method for clustering similar data points.
The goal is to represent degrees of similarity within data.
Enables the partition of data at various levels.

Question 2

Q

What is the structure of hierarchical clustering?

Answer

A

Clusters and sub-clusters.
Results in tree-like structure.

Question 3

Q

What are Dendrograms?

Answer

A

Tree diagrams that show a hierarchy of clusters, where each node represents a cluster.

Question 4

Q

In Dendrograms, what are leaves called?

Answer

A

Singletons

Question 5

Q

What are the 2 approaches of creating a Dendrogram?

Answer

A

Bottom up -> Agglomerative
1. Top down -> Divisive

Question 6

Q

How is a Dendrogram created using the Agglomerative method?

Answer

A

Starting with a set of individual data points
Iteratively:
1. Create distance matrix between points
2. Merge close ones

Question 7

Q

How is a dendrogram created using Divisive method?

Answer

A

Start with a large set of clustered data points
Iteratively split the data points

Question 8

Q

What is the time complexity of Hierarchical clustering?

Question 9

Q

When hierarchical clustering, what is the only thing we need to be able to create?

Answer

A

Distance matrix between points

Question 10

Q

Why can’t we use Brute force to calculate Distance Matrix?

Answer

A

Too many points means too high complexity

Question 11

Q

What are the 5 methods we can use to calculate distance between clusters?

Answer

A

Simple Linkage -> Distance defined as the distance between 2 closest data points of 2 separate clusters.
Complete Linkage -> Distance defined as the distance between 2 furthest data points of 2 separate clusters.
Average Linkage -> Distance is defined as the average distance between all members of each clusters.
Centroid Linkage -> Distance defined by the distance between the centroids of each cluster.
Wards Method -> Join clusters only if it reduces the total distance from he centroids.

Question 12

Q

What is an issue with each of the 5 distance calculation methods?

Answer

A

Simple Linkage -> Can lead to a long chain of clusters
Complete Linkage -> Often breaks large clusters into 2 or more
Average Linkage -> High computational cost and complexity due to every data point needing to be visited.
Centroid Linkage -> Biased towards spherical clusters
Wards Method -> Biased towards spherical clusters

Question 13

Q

What method do we use to know if we have a good cluster count?

Answer

A

Elbow method

Question 14

Q

What is DBSCAN?

Answer

A

Not to compare every point (such as hierarchical clustering), but to have a density threshold around each point, and if the threshold area contains enough other data points, the point can be known as a Dense Point.

Question 15

Q

What are the 2 hyper parameters of DBSCAN?

Answer

A

Epsilon -> Radius of density threshold

MinPts -> Min no of points need in threshold to be considered a dense point

Question 16

Q

How are points categorised in DBSCAN?

Answer

Study These Flashcards

A

Core point -> If a point has more than a specified number MinPts in its Epsilon. These points are in the interior of a cluster

Border point -> I the point has fewer points that MinPts in it’s Epsilon, but is in the neighbourhood of a core point

Noise point -> Any point that is not a core or border point

Question 17

Q

What is the result of DBSCAN?

Answer

Study These Flashcards

A

All points within a cluster can reach one another through steps of size Epsilon

Question 18

Q

What are the pros and cons of DBSCAN?

Answer

Study These Flashcards

A

Pros -> Resistant to noise. Can handle clusters of different shapes and sizes.

Cons -> Eps and MinPts interact and can be hard to specify.

Question 19

Q

What are 2 limitations of DBSCAN?

Answer

Study These Flashcards

A

Struggles to cluster with varying densities
Highly dependent on Eps and MinPts selection

L13 - Hierarchical Structuring and DBSCAN Flashcards

(19 cards)