Unit 4 - Fundamentals of Clustering Flashcards
What is clustering?
Discover natural groupings in data
Clustering does not attempt to reveal relationships underlying the variables (as in the case of association analysis). Instead, clustering aims to group objects or records into different clusters to identify groups that exhibit a high degree of internal (within-cluster) similarity and external (between-cluster) dissimilarity.
What are some clustering applications?
- Market Segmentation
Divide customers into groups so that a mixture of selling strategies can be formulated to maximize company revenue - Credit Scoring
Bank or credit card company may apply clustering to Identify potential fraud cases - Anomaly detection– things that are not in the normal clusters
An underwriter attempts to single out the accident-prone clients who are likely to file claims after being insured - Image Segmentation
An image recognition system tries to recognize different parts of an image
What are some examples of clustering?
- A capitalist endeavours to spot some potential hotels for takeover, based on their financial health, market shares, location and quality of service.
- An underwriter attempts to single out the accident-prone clients who are likely to file claims after being insured.
- A football club targets to shortlist a pool of potential players, according to their experience, age, transfer fees, market value, goal-scoring rate and fitness.
- A higher learning committee plans to categorise colleges in terms of student-faculty ratio, graduation rate, research ranking and alumni donation.
- An international agency sets out to segment countries by means of indicators like GDP per capita, literacy rate, population density, life expectancy and standards of living.
What is a cluster?
a set in which its members are similar to one another but dissimilar to others not within the same cluster.
Why is clustering different from association analysis?
clustering finds relationship
between records, association finds the relationship between field
Why is clustering different from classification?
clustering does not train with a class label, as in building classifiers in supervised learning
Why is clustering different from factor analysis?
factor analysis groups variables; whereas clustering groups objects
Steps in cluster analysis?
1) define a measure of similarity
2) choose an appropriate clustering algorithm
3) set the number of clusters
4) generate the clusters
5) Interpret the clusters, profile the clusters with the aim of illustrating their differences
6) validate the clusters
What are the two types of clustering?
1) Hierarchical (agglomerative)
2) Partitional (divisive “top down”)
Steps of hierarchical clustering?
1) compute the proximity among the data objects
repeatedly merge the closest data objects
2) Update the proximity between the new and the original clusters until all objects are grouped to form one final cluster
Steps of partitioning clustering?
1) select k seed points as initial centroids
2) form k clusters by assigning each object to its closest centroid
3) update the centroid of each cluster until the stopping criterion is fulfilled
AKA K-means
What is a dendrogram?
Illustrate how the objects and levels of distance are grouped and changed
It can show how the clusters are formed
How reliant is clustering on an analyst? 7 poiints
- Exploratory in nature
- In hierarchical clustering, no grouping structure is assumed
- Clustering solution relies heavily on the analyst’s knowledge to label the clusters in a meaningful way
- Different solutions may be obtained with different stopping rules
- Usually, more than one competing solutions are evaluated before the clustering solution is determined
- The interpretation and labelling of a cluster is a subjective or even creative affair
- In some instances, there may be more than one interpretation for the identified clusters
What are proximity measures?
closeness / distance measure / similarity
What are the commonly used proximity measures?
- Euclidean distance
- Mahalanobis distance
- Minkowski distance.
Difference between Mahalanobis distance and Euclidean distance?
A and B are equally distant from X BUT this is not true
if you look at the density distribution
- Mahalanobis takes dispersion into account
- Also useful for anomaly detection esp. for non-spherical-shaped distribution, such as the ellipsoidal shape in the right figure.
Difference between Euclidean and Minkowski?
Euclidean distance is a special case of Minkowski distance with M=2
What does clustering criteria Root-mean-square standard deviation do?
pools the variations of all the clustering variables that form the cluster at a certain stage of the clustering process
The formula of Root-Mean-Square standard deviation?
Pooled sum of squares for all clustering variables divided by pooled degrees of freedom for all the clustering variables
small RMSSTD > small variations > homogenous cluster
large RMSSTD > large variations > heterogenous cluster
Is 3-cluster solution or 4-cluster solution preferred?
3-cluster as it has a smaller RMSSTD
What are other rules to determine the number of clusters?
Akaike’s information criterion
Bayesian Information criterion
Average silhouette
What is Ward’s linkage metric?
used to minimise the total within-cluster sum of squares.
Illustration of agglomerative methods of clustering?
- Assign each object to one cluster (N objects, N clusters).
- Calculate the distances (similarities) between all objects. This can be done using different methods of linkage (e.g. Single linkage, complete linkage, average linkage, centroid method, or Ward’s method).
- Find the closest (most similar) pair of clusters and combine them into one single cluster; therefore, we get one cluster less.
- Calculate the distances (similarities) between the new cluster and others.
- Repeat steps 2, 3 and 4 until the moment when all objects are grouped together in one single cluster of size N (n-1 iterations).
What is hierarchical clustering using single linkage?
min (10,20) = 10
the minimum distance of all possible combination of objects in the clusters
What’s the final outcome of a single linkage?
one big cluster
What is hierarchical clustering using complete linkage?
max (10,20) = 20
the maximum distance of all possible combination of objects in the clusters
What is hierarchical clustering using centroid linkage?
d (Cluster m, Cluster k)
the distance between the centroids of clusters
What is hierarchical clustering using average linkage?
average (10,20) =15
Aka group-average hierarchical clustering
Effects of single linkage?
Good for detecting arbitrary-shaped clusters
cannot detect overlapping clusters
Effects of complete-linkage?
good for detecting overlapping clusters
not for detecting arbitrary-shaped clusters
effects of centroid and average linkage?
somewhere in between single linkage and complete linkage
Practical issues of hierarchical clustering?
Works more efficiently with a pre-computed n by n proximity matrix
At least N^2 time and space complexity
> With 1000 records
- need 1 million time units to compute
- need 1 million memory units to store the proximity values