2: Clustering Flashcards

1
Q

What is meant by clustering?

A

Clustering means to gather data records into natural groups (i.e., clusters) of similar samples according to predefined similarity/dissimilarity metrics, which results in extracting a set of useful information about the given dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is high inter-cluster separation?

A

The contents of any cluster should be very different.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the similarity/dissimilarity metric used in UL?

A

A form of distance function between each pair of data records.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Distance meassure and combining A & B.

A

The distance measure how close A and B are to each other, and a decision is made whether to combine A and A in one cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the forms of distance functions?

A

Euclidean and Manhattan.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How are the Euclidean and Manhattan distance calcualted?

A

… for two dimensional datasets (i.e., having two features) ….

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the applications of Clustering?

A

pattern recognition, image processing, spatial data analysis, bioinformatics, crime analysis, medical imaging, climatology, and robotics, market segmentation, recommendation system (spotify)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the main classes of methods and techniques in UL?

A

Centroid-based clustering methods
Gaussian mixtures models clustering methods
Hierarchical clustering methods
Density based clustering methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is centroid-based clustering (K-Means)?

A

Centroid-based clustering searches for a pre-determined number of clusters within an unlabeled and possibly multidimensional dataset. The rule is that the distance between a data record and each of the cluster’s centroids is calculated, and this data record is assigned to the cluster achieving the minimum distance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What steps does K-means consists of?

A

Initialization (nr of clusters + random centroid), assignment (the clusters are formed by connecting each data record with its nearest centroid), and update step (repeated until convergence).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

K-Means + & -?

A

simple to employ, + can be used for large datasets smoothly + adjusts the outliers, - significantly difficult to predict the nr of clusters, - It assumes clusters are round, so it doesn’t work well for data that has groups with different shapes - Evaluating distances becomes exceedingly less informative in high-dimensional spaces.+

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain k-means clustering.

A

K-means clustering is a simple and elegant algorithm to partition a dataset into k distinct, non-overlapping clusters. Choose a number of clusters .kRandomly assign a number between one and k to each observation. These serve as initial cluster assignments. For each of the k-clusters we then compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster. Assign each observation to the cluster whose centroid is closest (where closest is defined using a distance metric). Iterate until cluster assignments stop changing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a Gaussian mixture model?

A

Gaussian mixture models (GMM) clustering, each cluster is considered as a probabilistic generative model. A data record has a probability for belonging to each cluster, and it is assigned to the cluster returning the highest probability. As in the k-means method, GMM also initially assumes the number of clusters for the input dataset. GMM tries to fit mixtures of Gaussian distributions to the dataset, where each distribution defines one cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the limitation of the GMM?

A

The algorithm can be slow because it finds the probability distribution for each cluster. It can also get stuck in a local maximum of the log-likelihood function. The steps, in general, suffer from heavy computations of the conditional probabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the two parameters that each cluster in GMM follows?

A

Cluster mean and standard deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the task in GGM?

A

The task is to find the optimum values for both the mean and the standard deviation which maximize the data records to be in their assigned clusters.

17
Q

What is used to achieve the task in GMM?

A

To achieve the task, the expectation maximization algorithm is implemented.

18
Q

What does EM algorithm do?

A

It is an iterative method to find the maximum likehood of parameters (mea, standard d) to get the “best fit” model for the input dataset.

19
Q

What types of hierarchical clustering are there?

A

Agglomerative (from single to universe) and divisive.

20
Q

What is a dendogram?

A

A tool to show how the clusters progressed from the “single-point” clusters to the “universe” cluster, and it also aids the decision for the optimum number of clusters in the given dataset.

21
Q

How to find out how many clusters should one choose in hierarchical clustering?

A

To find the best number of clusters in a dendrogram:

Find the longest vertical line that doesn’t touch any other lines.
Draw a horizontal line through this vertical line.
The number of times the horizontal line crosses other lines is the number of clusters.

22
Q

Please explain the difference(s) between agglomerative and divisive hierarchical clustering.

A

Agglomerative hierarchical clustering works according to the bottom-up approach. In the agglomerative hierarchical method, at first, each object creates its own cluster. The single element clusters are successively merged to make larger clusters and the process of merging continues until all the singular clusters are merged into one big cluster that consists of all the objects.Divisive hierarchical clustering method works according to the top-down approach. In this method, all the objects are arranged within a big all-encompassing cluster and the large cluster is continuously divided into smaller clusters until each cluster holds only a single object.

23
Q

Biggest benefit of hierarchical clusters?

A

The most important value of the hierarchical clustering is the lack of assumptions concerning a particular number of clusters in the beginning of the algorithm.

24
Q

What are the downsides to hierarchical clustering?

A

Once a decision is made to combine two clusters, it cannot be undone. Hence, it is not possible to undo the previous step. Hierarchical clustering is not suitable for large datasets and is very sensitive to outliers within the data.

25
Q

What is Density-Based clustering about?

A

Density-based clustering approaches work to identify clusters by grouping “dense” data records together, which permits the representation of arbitrarily shaped clusters, as well as the learning of outliers within the data.

26
Q

What is the “dense” estimation based on?

A

The estimation of “dense” clusters is based on the ɛ-neighborhood concept.

27
Q

What are the benefits of DBSCAN?

A

An important advantage of DBSCAN is that it does not require us to estimate the number of clusters in the beginning of the algorithm. DBSCAN infers the number of clusters according to the underlying dataset, and it discovers arbitrarily shaped clusters. This technique efficiently separates clusters of high density versus clusters of low density within a given dataset, while pragmatically handling the contained outliers.

28
Q

What are downsides of DBSCAN?

A

DBSCAN struggles with datasets that have clusters of different densities because the same parameters (ɛ and MinPts) apply to all clusters. If the dataset isn’t well understood, choosing these parameters can be difficult and may require multiple trial-and-error adjustments.

29
Q

With respect to DBSCAN algorithm, please give the definitions for (a) core point, (b) border point, and (c) noise point (i.e., outlier).

A

A point is a core point if it has more than a specified number of points (MinPts) within Eps—These are points that are at the interior of a cluster.

A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point.

A noise point is any point that is not a core point nor a border point.

30
Q

Each clustering algorithm has its strengths and weaknesses. Please compare K-Means and DBSCAN in this regard.

A

K-Means requires the setting of the number of clusters in advance whereas the number of clusters emerges as an artefact of the method in the case of DBSAN. DBSAN on the other hand is based on the two parameters ϵ and MinPts that have to bet set beforehand and that typically require a more in-depth knowledge of the data distribution.DBSAN can infer clusters of arbitrary shape but ideally, the density of xata points should be roughly the same across different groups.

31
Q

Please define DBSCAN.

A

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It is a density-based clustering method that converts regions with high object density into clusters with arbitrary shapes and sizes. DBSCAN defines a cluster as a maximal set of density connected points.