Clustering Flashcards
what is clustering
It is about partitioning the data points into groups where a group have some kind of similarity
what are the types of clustering
Representative-based algorithm K-mean
Hierarchical clustering algorithm where we build hierarchy (an automate for instance )we start from individual data points and find those that are most similar, at a certain level look at this clusters then perform the same thing
probabilistic model based algorithm : soft
Explain the representative based algorithm
Usually data and k are given, generally k represents the number of clusters,
- representative are initially generated using a probabilistic sampling
- itiratively :
generate clusters according to the points that are near to the representative using the distance function - perform an optimization step, where the representative is set according to the cluster centroid
until convergence
what is the time complexity of K-means
The assign part :O(k.M) where M is the size of each cluster
the optimize part O(M)
what is the disadventage of K-means
the algorithm might be biased, specially in cases where there are outliers
K-means require distance function and a function to compute averages
it actually work well for spherical cluster, however, for arbitrary shaped clusters it doesn’t work very well
what is the key difference k-means and medoids
the main difference consists simpbasedly there is no need for a function computing the average
what are the key issue with medoid
finding the K and the representatives
why we would we use the grid-based density
this is because some other clustering algorithms assume sphere-based clusters
what is the main idea behind the dense cluster
1- detect dense areas
2- grow and merge them
explain briefly how grid algorithms work
the idea is :
to discretize the data into P intervals, note that the number of grids corresponds to P^d where d is the dimensions
then density threshold τ is used to define the hypre cubes
how to choose p and τ
● When p is too small ?
○ points from multiple clusters will be present in the same
grid region => undesirable merging of clusters.
● When p is too large ?
○ Many empty grid cells => natural clusters in the data may
be disconnected
● When τ is too low ?
○ clusters including the ambient noise, will be merged
● When τ is too high ?
○ We can partially or entirely miss a cluster.
what are the main points that DBSCAN is able to distinguish
core points: are the ones that have τ points in their neighbourhood
neighbour points: are the ones that fall in the neighboorhood of some core points
noise points : are neither core or neighbour points
what is progressive DBSCAN
the idea consists of clustering, relax the taw then apply the algorithm again the previously clusters data should be omitted, so we capture the newly available
what is the idea
the intuition is that every data point contributes ti density value
what is a fuzzy clustering
a cluster is a fuzzy set of objects, such that each element is characterized by a degree of membership to this cluster,