Clustering Flashcards
Clustering
set of techniques for finding subgroups, or clusters, in a data set. A good clustering is one when the observations within a group are similar but between groups are very different
high intra-class similarity:
cohesive within clusters
low inter-class similarity
distinctive between clusters
Question: What factors determine the quality of a clustering method?
Answer:
The similarity measure used by the method
Its implementation
Its ability to discover some or all of the hidden patterns
Centroid based Clustering
representing clusters through its center is called the centroid
K-Means
Centroid based algorithm. “means” are the centroids, represented by the arithmetic means (averages) of the values along each dimension for the instances in the cluster
what is Clusters’ Distortion
The sum of squared differences between each data point and its corresponding centroid. The clustering with the lowest distortion value can be deemed the best clustering
K Means weakness
-Sensitive to noisy data and outliers
– Weak in clustering non-convex shapes
K Means strength
– Fast and efficient
– Often terminates at a local optimal solution
– Applicable to continuous n-dimensional space
How is K-means observation clustered?
One way of achieving this is to minimize the sum of all the pair-wise squared Euclidean distances between the observations in each cluster.
general equation for Euclidean distance in n dimensions is as below:
d(A,B)= Sqrt of (d1,A-d1,B)^2+so on n times
what is Dendogram?
Hierarchical Clustering has an added advantage that it produces a tree based representation of the observations
Hierarchical Clustering
a tree based representation of the observations, called a Dendogram. doesn’t require to choose number of clusters.
Hierarchical Clustering Weakness
- Not so scalable;
- Distance matrix can be huge to calculate
– Cannot undo what was done previously
Hierarchical Clustering Strength
– Availability of dendrogram
– Smaller clusters may be generated.