Clustering Flashcards

Question 1

Q

what is clustering

Answer

A

It is about partitioning the data points into groups where a group have some kind of similarity

Question 2

Q

what are the types of clustering

Answer

A

Representative-based algorithm K-mean

Hierarchical clustering algorithm where we build hierarchy (an automate for instance )we start from individual data points and find those that are most similar, at a certain level look at this clusters then perform the same thing

probabilistic model based algorithm : soft

Question 3

Q

Explain the representative based algorithm

Answer

A

Usually data and k are given, generally k represents the number of clusters,

representative are initially generated using a probabilistic sampling
itiratively :
generate clusters according to the points that are near to the representative using the distance function
perform an optimization step, where the representative is set according to the cluster centroid
until convergence

Question 4

Q

what is the time complexity of K-means

Answer

A

The assign part :O(k.M) where M is the size of each cluster

the optimize part O(M)

Question 5

Q

what is the disadventage of K-means

Answer

A

the algorithm might be biased, specially in cases where there are outliers

K-means require distance function and a function to compute averages

it actually work well for spherical cluster, however, for arbitrary shaped clusters it doesn’t work very well

Question 6

Q

what is the key difference k-means and medoids

Answer

A

the main difference consists simpbasedly there is no need for a function computing the average

Question 7

Q

what are the key issue with medoid

Answer

A

finding the K and the representatives

Question 8

Q

why we would we use the grid-based density

Answer

A

this is because some other clustering algorithms assume sphere-based clusters

Question 9

Q

what is the main idea behind the dense cluster

Answer

A

1- detect dense areas

2- grow and merge them

Question 10

Q

explain briefly how grid algorithms work

Answer

A

the idea is :
to discretize the data into P intervals, note that the number of grids corresponds to P^d where d is the dimensions
then density threshold τ is used to define the hypre cubes

Question 11

Q

how to choose p and τ

Answer

A

● When p is too small ?
○ points from multiple clusters will be present in the same
grid region => undesirable merging of clusters.
● When p is too large ?
○ Many empty grid cells => natural clusters in the data may
be disconnected
● When τ is too low ?
○ clusters including the ambient noise, will be merged
● When τ is too high ?
○ We can partially or entirely miss a cluster.

Question 12

Q

what are the main points that DBSCAN is able to distinguish

Answer

A

core points: are the ones that have τ points in their neighbourhood
neighbour points: are the ones that fall in the neighboorhood of some core points
noise points : are neither core or neighbour points

Question 13

Q

what is progressive DBSCAN

Answer

A

the idea consists of clustering, relax the taw then apply the algorithm again the previously clusters data should be omitted, so we capture the newly available

Question 14

Q

what is the idea

Answer

A

the intuition is that every data point contributes ti density value

Question 15

Q

what is a fuzzy clustering

Answer

A

a cluster is a fuzzy set of objects, such that each element is characterized by a degree of membership to this cluster,

Question 16

Q

explain briefly how fuzzy clustering work

Answer

Study These Flashcards

A

In order to compute the membership in fuzzy clustering, we obtain a partition Matrix M where each row refers the degree of membership to each cluster and the summation of this row is equal to 1

Question 17

Q

How to evaluate how well a fuzzy clustering describes a data set

Answer

Study These Flashcards

A

use the SUM of squared error!

Question 18

Q

explain Probability-based clustering

Answer

Study These Flashcards

A

Statistically, we can assume that a hidden cluster is a distribution over the
data space, which can be mathematically represented using a probability
density function

Question 19

Q

Explain the fuzzy clustering in a more concrete way

Answer

Study These Flashcards

A

The previously defined fuzzy clustering is not really concrete, for this reason, we usually use the Guassian distribution , hence we assume each model is a guassian model with parameter, the mean and standard deviation as if it s a latent mode.
Here we have the data and we try to find the gaussian distribution behind

Question 20

Q

Explain briefly the EM algorithm

Answer

Study These Flashcards

A

Generally, for every cluster, we start by a random assignment
then we optimize by relying on the SSE, here where the weight change and the cluster centroid is modified

Question 21

Q

what is a good clustering

Answer

Study These Flashcards

A

A good clustering method will produce high quality clusters
high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on
the similarity measure used by the method
its implementation, and
Its ability to discover some or all of the hidden patterns