Unit 4 - Fundamentals of Clustering Flashcards by Choi Chi Min

What is clustering?

Discover natural groupings in data

Clustering does not attempt to reveal relationships underlying the variables (as in the case of association analysis). Instead, clustering aims to group objects or records into different clusters to identify groups that exhibit a high degree of internal (within-cluster) similarity and external (between-cluster) dissimilarity.

How well did you know this?

Not at all

Perfectly

What are some clustering applications?

Market Segmentation
Divide customers into groups so that a mixture of selling strategies can be formulated to maximize company revenue
Credit Scoring
Bank or credit card company may apply clustering to Identify potential fraud cases
Anomaly detection– things that are not in the normal clusters
An underwriter attempts to single out the accident-prone clients who are likely to file claims after being insured
Image Segmentation
An image recognition system tries to recognize different parts of an image

How well did you know this?

Not at all

Perfectly

What are some examples of clustering?

A capitalist endeavours to spot some potential hotels for takeover, based on their financial health, market shares, location and quality of service.
An underwriter attempts to single out the accident-prone clients who are likely to file claims after being insured.
A football club targets to shortlist a pool of potential players, according to their experience, age, transfer fees, market value, goal-scoring rate and fitness.
A higher learning committee plans to categorise colleges in terms of student-faculty ratio, graduation rate, research ranking and alumni donation.
An international agency sets out to segment countries by means of indicators like GDP per capita, literacy rate, population density, life expectancy and standards of living.

How well did you know this?

Not at all

Perfectly

What is a cluster?

a set in which its members are similar to one another but dissimilar to others not within the same cluster.

How well did you know this?

Not at all

Perfectly

Why is clustering different from association analysis?

clustering finds relationship

between records, association finds the relationship between field

How well did you know this?

Not at all

Perfectly

Why is clustering different from classification?

clustering does not train with a class
label, as in building classifiers in supervised learning

How well did you know this?

Not at all

Perfectly

Why is clustering different from factor analysis?

factor analysis groups variables; whereas clustering groups objects

How well did you know this?

Not at all

Perfectly

Steps in cluster analysis?

1) define a measure of similarity
2) choose an appropriate clustering algorithm
3) set the number of clusters
4) generate the clusters
5) Interpret the clusters, profile the clusters with the aim of illustrating their differences
6) validate the clusters

How well did you know this?

Not at all

Perfectly

What are the two types of clustering?

1) Hierarchical (agglomerative)

2) Partitional (divisive “top down”)

How well did you know this?

Not at all

Perfectly

Steps of hierarchical clustering?

1) compute the proximity among the data objects
repeatedly merge the closest data objects
2) Update the proximity between the new and the original clusters until all objects are grouped to form one final cluster

How well did you know this?

Not at all

Perfectly

Steps of partitioning clustering?

1) select k seed points as initial centroids
2) form k clusters by assigning each object to its closest centroid
3) update the centroid of each cluster until the stopping criterion is fulfilled
AKA K-means

How well did you know this?

Not at all

Perfectly

What is a dendrogram?

Illustrate how the objects and levels of distance are grouped and changed
It can show how the clusters are formed

How well did you know this?

Not at all

Perfectly

How reliant is clustering on an analyst? 7 poiints

Exploratory in nature
In hierarchical clustering, no grouping structure is assumed
Clustering solution relies heavily on the analyst’s knowledge to label the clusters in a meaningful way
Different solutions may be obtained with different stopping rules
Usually, more than one competing solutions are evaluated before the clustering solution is determined
The interpretation and labelling of a cluster is a subjective or even creative affair
In some instances, there may be more than one interpretation for the identified clusters

How well did you know this?

Not at all

Perfectly

What are proximity measures?

closeness / distance measure / similarity

How well did you know this?

Not at all

Perfectly

What are the commonly used proximity measures?

Euclidean distance
Mahalanobis distance
Minkowski distance.

How well did you know this?

Not at all

Perfectly

Difference between Mahalanobis distance and Euclidean distance?

Study These Flashcards

A and B are equally distant from X BUT this is not true
if you look at the density distribution

Mahalanobis takes dispersion into account
Also useful for anomaly detection esp. for non-spherical-shaped distribution, such as the ellipsoidal shape in the right figure.

Difference between Euclidean and Minkowski?

Study These Flashcards

Euclidean distance is a special case of Minkowski distance with M=2

What does clustering criteria Root-mean-square standard deviation do?

Study These Flashcards

pools the variations of all the clustering variables that form the cluster at a certain stage of the clustering process

The formula of Root-Mean-Square standard deviation?

Study These Flashcards

Pooled sum of squares for all clustering variables divided by pooled degrees of freedom for all the clustering variables

small RMSSTD > small variations > homogenous cluster
large RMSSTD > large variations > heterogenous cluster

Is 3-cluster solution or 4-cluster solution preferred?

Study These Flashcards

3-cluster as it has a smaller RMSSTD

What are other rules to determine the number of clusters?

Study These Flashcards

Akaike’s information criterion
Bayesian Information criterion
Average silhouette

What is Ward’s linkage metric?

Study These Flashcards

used to minimise the total within-cluster sum of squares.

Illustration of agglomerative methods of clustering?

Study These Flashcards

Assign each object to one cluster (N objects, N clusters).
Calculate the distances (similarities) between all objects. This can be done using different methods of linkage (e.g. Single linkage, complete linkage, average linkage, centroid method, or Ward’s method).
Find the closest (most similar) pair of clusters and combine them into one single cluster; therefore, we get one cluster less.
Calculate the distances (similarities) between the new cluster and others.
Repeat steps 2, 3 and 4 until the moment when all objects are grouped together in one single cluster of size N (n-1 iterations).

What is hierarchical clustering using single linkage?

Study These Flashcards

min (10,20) = 10

the minimum distance of all possible combination of objects in the clusters

What's the final outcome of a single linkage?

one big cluster

What is hierarchical clustering using complete linkage?

max (10,20) = 20 | the maximum distance of all possible combination of objects in the clusters

What is hierarchical clustering using centroid linkage?

d (Cluster m, Cluster k) | the distance between the centroids of clusters

What is hierarchical clustering using average linkage?

average (10,20) =15 | Aka group-average hierarchical clustering

Effects of single linkage?

Good for detecting arbitrary-shaped clusters | cannot detect overlapping clusters

Effects of complete-linkage?

good for detecting overlapping clusters | not for detecting arbitrary-shaped clusters

effects of centroid and average linkage?

somewhere in between single linkage and complete linkage

Practical issues of hierarchical clustering?

Works more efficiently with a pre-computed n by n proximity matrix At least N^2 time and space complexity > With 1000 records - need 1 million time units to compute - need 1 million memory units to store the proximity values

Unit 4 - Fundamentals of Clustering Flashcards

(32 cards)