Clustering Flashcards
what is clustering
It is about partitioning the data points into groups where a group have some kind of similarity
what are the types of clustering
Representative-based algorithm K-mean
Hierarchical clustering algorithm where we build hierarchy (an automate for instance )we start from individual data points and find those that are most similar, at a certain level look at this clusters then perform the same thing
probabilistic model based algorithm : soft
Explain the representative based algorithm
Usually data and k are given, generally k represents the number of clusters,
- representative are initially generated using a probabilistic sampling
- itiratively :
generate clusters according to the points that are near to the representative using the distance function - perform an optimization step, where the representative is set according to the cluster centroid
until convergence
what is the time complexity of K-means
The assign part :O(k.M) where M is the size of each cluster
the optimize part O(M)
what is the disadventage of K-means
the algorithm might be biased, specially in cases where there are outliers
K-means require distance function and a function to compute averages
it actually work well for spherical cluster, however, for arbitrary shaped clusters it doesn’t work very well
what is the key difference k-means and medoids
the main difference consists simpbasedly there is no need for a function computing the average
what are the key issue with medoid
finding the K and the representatives
why we would we use the grid-based density
this is because some other clustering algorithms assume sphere-based clusters
what is the main idea behind the dense cluster
1- detect dense areas
2- grow and merge them
explain briefly how grid algorithms work
the idea is :
to discretize the data into P intervals, note that the number of grids corresponds to P^d where d is the dimensions
then density threshold τ is used to define the hypre cubes
how to choose p and τ
● When p is too small ?
○ points from multiple clusters will be present in the same
grid region => undesirable merging of clusters.
● When p is too large ?
○ Many empty grid cells => natural clusters in the data may
be disconnected
● When τ is too low ?
○ clusters including the ambient noise, will be merged
● When τ is too high ?
○ We can partially or entirely miss a cluster.
what are the main points that DBSCAN is able to distinguish
core points: are the ones that have τ points in their neighbourhood
neighbour points: are the ones that fall in the neighboorhood of some core points
noise points : are neither core or neighbour points
what is progressive DBSCAN
the idea consists of clustering, relax the taw then apply the algorithm again the previously clusters data should be omitted, so we capture the newly available
what is the idea
the intuition is that every data point contributes ti density value
what is a fuzzy clustering
a cluster is a fuzzy set of objects, such that each element is characterized by a degree of membership to this cluster,
explain briefly how fuzzy clustering work
In order to compute the membership in fuzzy clustering, we obtain a partition Matrix M where each row refers the degree of membership to each cluster and the summation of this row is equal to 1
How to evaluate how well a fuzzy clustering describes a data set
use the SUM of squared error!
explain Probability-based clustering
Statistically, we can assume that a hidden cluster is a distribution over the
data space, which can be mathematically represented using a probability
density function
Explain the fuzzy clustering in a more concrete way
The previously defined fuzzy clustering is not really concrete, for this reason, we usually use the Guassian distribution , hence we assume each model is a guassian model with parameter, the mean and standard deviation as if it s a latent mode.
Here we have the data and we try to find the gaussian distribution behind
Explain briefly the EM algorithm
Generally, for every cluster, we start by a random assignment
then we optimize by relying on the SSE, here where the weight change and the cluster centroid is modified
what is a good clustering
A good clustering method will produce high quality clusters
high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on
the similarity measure used by the method
its implementation, and
Its ability to discover some or all of the hidden patterns
why using probabilistic model cluster
in some situations, an element may belong to several clusters for instance in product review where the client
what is the main objective of representative base dclustering
the main objective is to minimize the distance to the cluster center
explain briefly the K medoid clustering
this type of clustering does not require any mean or media computation, the cluster representative are always chosen from the dataset