Unit 4 - Fundamentals of Clustering Flashcards
What is clustering?
Discover natural groupings in data
Clustering does not attempt to reveal relationships underlying the variables (as in the case of association analysis). Instead, clustering aims to group objects or records into different clusters to identify groups that exhibit a high degree of internal (within-cluster) similarity and external (between-cluster) dissimilarity.
What are some clustering applications?
- Market Segmentation
Divide customers into groups so that a mixture of selling strategies can be formulated to maximize company revenue - Credit Scoring
Bank or credit card company may apply clustering to Identify potential fraud cases - Anomaly detection– things that are not in the normal clusters
An underwriter attempts to single out the accident-prone clients who are likely to file claims after being insured - Image Segmentation
An image recognition system tries to recognize different parts of an image
What are some examples of clustering?
- A capitalist endeavours to spot some potential hotels for takeover, based on their financial health, market shares, location and quality of service.
- An underwriter attempts to single out the accident-prone clients who are likely to file claims after being insured.
- A football club targets to shortlist a pool of potential players, according to their experience, age, transfer fees, market value, goal-scoring rate and fitness.
- A higher learning committee plans to categorise colleges in terms of student-faculty ratio, graduation rate, research ranking and alumni donation.
- An international agency sets out to segment countries by means of indicators like GDP per capita, literacy rate, population density, life expectancy and standards of living.
What is a cluster?
a set in which its members are similar to one another but dissimilar to others not within the same cluster.
Why is clustering different from association analysis?
clustering finds relationship
between records, association finds the relationship between field
Why is clustering different from classification?
clustering does not train with a class label, as in building classifiers in supervised learning
Why is clustering different from factor analysis?
factor analysis groups variables; whereas clustering groups objects
Steps in cluster analysis?
1) define a measure of similarity
2) choose an appropriate clustering algorithm
3) set the number of clusters
4) generate the clusters
5) Interpret the clusters, profile the clusters with the aim of illustrating their differences
6) validate the clusters
What are the two types of clustering?
1) Hierarchical (agglomerative)
2) Partitional (divisive “top down”)
Steps of hierarchical clustering?
1) compute the proximity among the data objects
repeatedly merge the closest data objects
2) Update the proximity between the new and the original clusters until all objects are grouped to form one final cluster
Steps of partitioning clustering?
1) select k seed points as initial centroids
2) form k clusters by assigning each object to its closest centroid
3) update the centroid of each cluster until the stopping criterion is fulfilled
AKA K-means
What is a dendrogram?
Illustrate how the objects and levels of distance are grouped and changed
It can show how the clusters are formed
How reliant is clustering on an analyst? 7 poiints
- Exploratory in nature
- In hierarchical clustering, no grouping structure is assumed
- Clustering solution relies heavily on the analyst’s knowledge to label the clusters in a meaningful way
- Different solutions may be obtained with different stopping rules
- Usually, more than one competing solutions are evaluated before the clustering solution is determined
- The interpretation and labelling of a cluster is a subjective or even creative affair
- In some instances, there may be more than one interpretation for the identified clusters
What are proximity measures?
closeness / distance measure / similarity
What are the commonly used proximity measures?
- Euclidean distance
- Mahalanobis distance
- Minkowski distance.