Clustering Flashcards

1
Q

what is clustering

A

It is about partitioning the data points into groups where a group have some kind of similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what are the types of clustering

A

Representative-based algorithm K-mean

Hierarchical clustering algorithm where we build hierarchy (an automate for instance )we start from individual data points and find those that are most similar, at a certain level look at this clusters then perform the same thing

probabilistic model based algorithm : soft

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the representative based algorithm

A

Usually data and k are given, generally k represents the number of clusters,

  • representative are initially generated using a probabilistic sampling
  • itiratively :
    generate clusters according to the points that are near to the representative using the distance function
  • perform an optimization step, where the representative is set according to the cluster centroid
    until convergence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is the time complexity of K-means

A

The assign part :O(k.M) where M is the size of each cluster

the optimize part O(M)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is the disadventage of K-means

A

the algorithm might be biased, specially in cases where there are outliers

K-means require distance function and a function to compute averages

it actually work well for spherical cluster, however, for arbitrary shaped clusters it doesn’t work very well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is the key difference k-means and medoids

A

the main difference consists simpbasedly there is no need for a function computing the average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what are the key issue with medoid

A

finding the K and the representatives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

why we would we use the grid-based density

A

this is because some other clustering algorithms assume sphere-based clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is the main idea behind the dense cluster

A

1- detect dense areas

2- grow and merge them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

explain briefly how grid algorithms work

A

the idea is :
to discretize the data into P intervals, note that the number of grids corresponds to P^d where d is the dimensions
then density threshold τ is used to define the hypre cubes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

how to choose p and τ

A

● When p is too small ?
○ points from multiple clusters will be present in the same
grid region => undesirable merging of clusters.
● When p is too large ?
○ Many empty grid cells => natural clusters in the data may
be disconnected
● When τ is too low ?
○ clusters including the ambient noise, will be merged
● When τ is too high ?
○ We can partially or entirely miss a cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what are the main points that DBSCAN is able to distinguish

A

core points: are the ones that have τ points in their neighbourhood
neighbour points: are the ones that fall in the neighboorhood of some core points
noise points : are neither core or neighbour points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is progressive DBSCAN

A

the idea consists of clustering, relax the taw then apply the algorithm again the previously clusters data should be omitted, so we capture the newly available

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is the idea

A

the intuition is that every data point contributes ti density value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is a fuzzy clustering

A

a cluster is a fuzzy set of objects, such that each element is characterized by a degree of membership to this cluster,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

explain briefly how fuzzy clustering work

A

In order to compute the membership in fuzzy clustering, we obtain a partition Matrix M where each row refers the degree of membership to each cluster and the summation of this row is equal to 1

17
Q

How to evaluate how well a fuzzy clustering describes a data set

A

use the SUM of squared error!

18
Q

explain Probability-based clustering

A

Statistically, we can assume that a hidden cluster is a distribution over the
data space, which can be mathematically represented using a probability
density function

19
Q

Explain the fuzzy clustering in a more concrete way

A

The previously defined fuzzy clustering is not really concrete, for this reason, we usually use the Guassian distribution , hence we assume each model is a guassian model with parameter, the mean and standard deviation as if it s a latent mode.
Here we have the data and we try to find the gaussian distribution behind

20
Q

Explain briefly the EM algorithm

A

Generally, for every cluster, we start by a random assignment
then we optimize by relying on the SSE, here where the weight change and the cluster centroid is modified

21
Q

what is a good clustering

A

A good clustering method will produce high quality clusters
high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on
the similarity measure used by the method
its implementation, and
Its ability to discover some or all of the hidden patterns

22
Q

why using probabilistic model cluster

A

in some situations, an element may belong to several clusters for instance in product review where the client

23
Q

what is the main objective of representative base dclustering

A

the main objective is to minimize the distance to the cluster center

24
Q

explain briefly the K medoid clustering

A

this type of clustering does not require any mean or media computation, the cluster representative are always chosen from the dataset