Unit 4 - Fundamentals of Clustering Flashcards

1
Q

What is clustering?

A

Discover natural groupings in data

Clustering does not attempt to reveal relationships underlying the variables (as in the case of association analysis). Instead, clustering aims to group objects or records into different clusters to identify groups that exhibit a high degree of internal (within-cluster) similarity and external (between-cluster) dissimilarity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some clustering applications?

A
  • Market Segmentation
    Divide customers into groups so that a mixture of selling strategies can be formulated to maximize company revenue
  • Credit Scoring
    Bank or credit card company may apply clustering to Identify potential fraud cases
  • Anomaly detection– things that are not in the normal clusters
    An underwriter attempts to single out the accident-prone clients who are likely to file claims after being insured
  • Image Segmentation
    An image recognition system tries to recognize different parts of an image
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some examples of clustering?

A
  • A capitalist endeavours to spot some potential hotels for takeover, based on their financial health, market shares, location and quality of service.
  • An underwriter attempts to single out the accident-prone clients who are likely to file claims after being insured.
  • A football club targets to shortlist a pool of potential players, according to their experience, age, transfer fees, market value, goal-scoring rate and fitness.
  • A higher learning committee plans to categorise colleges in terms of student-faculty ratio, graduation rate, research ranking and alumni donation.
  • An international agency sets out to segment countries by means of indicators like GDP per capita, literacy rate, population density, life expectancy and standards of living.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a cluster?

A

a set in which its members are similar to one another but dissimilar to others not within the same cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is clustering different from association analysis?

A

clustering finds relationship

between records, association finds the relationship between field

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why is clustering different from classification?

A
clustering does not train with a class
label, as in building classifiers in supervised learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is clustering different from factor analysis?

A

factor analysis groups variables; whereas clustering groups objects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Steps in cluster analysis?

A

1) define a measure of similarity
2) choose an appropriate clustering algorithm
3) set the number of clusters
4) generate the clusters
5) Interpret the clusters, profile the clusters with the aim of illustrating their differences
6) validate the clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the two types of clustering?

A

1) Hierarchical (agglomerative)

2) Partitional (divisive “top down”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Steps of hierarchical clustering?

A

1) compute the proximity among the data objects
repeatedly merge the closest data objects
2) Update the proximity between the new and the original clusters until all objects are grouped to form one final cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Steps of partitioning clustering?

A

1) select k seed points as initial centroids
2) form k clusters by assigning each object to its closest centroid
3) update the centroid of each cluster until the stopping criterion is fulfilled
AKA K-means

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a dendrogram?

A

Illustrate how the objects and levels of distance are grouped and changed
It can show how the clusters are formed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How reliant is clustering on an analyst? 7 poiints

A
  • Exploratory in nature
  • In hierarchical clustering, no grouping structure is assumed
  • Clustering solution relies heavily on the analyst’s knowledge to label the clusters in a meaningful way
  • Different solutions may be obtained with different stopping rules
  • Usually, more than one competing solutions are evaluated before the clustering solution is determined
  • The interpretation and labelling of a cluster is a subjective or even creative affair
  • In some instances, there may be more than one interpretation for the identified clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are proximity measures?

A

closeness / distance measure / similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the commonly used proximity measures?

A
  • Euclidean distance
  • Mahalanobis distance
  • Minkowski distance.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Difference between Mahalanobis distance and Euclidean distance?

A

A and B are equally distant from X BUT this is not true
if you look at the density distribution

  • Mahalanobis takes dispersion into account
  • Also useful for anomaly detection esp. for non-spherical-shaped distribution, such as the ellipsoidal shape in the right figure.
17
Q

Difference between Euclidean and Minkowski?

A

Euclidean distance is a special case of Minkowski distance with M=2

18
Q

What does clustering criteria Root-mean-square standard deviation do?

A

pools the variations of all the clustering variables that form the cluster at a certain stage of the clustering process

19
Q

The formula of Root-Mean-Square standard deviation?

A

Pooled sum of squares for all clustering variables divided by pooled degrees of freedom for all the clustering variables

small RMSSTD > small variations > homogenous cluster
large RMSSTD > large variations > heterogenous cluster

20
Q

Is 3-cluster solution or 4-cluster solution preferred?

A

3-cluster as it has a smaller RMSSTD

21
Q

What are other rules to determine the number of clusters?

A

Akaike’s information criterion
Bayesian Information criterion
Average silhouette

22
Q

What is Ward’s linkage metric?

A

used to minimise the total within-cluster sum of squares.

23
Q

Illustration of agglomerative methods of clustering?

A
  1. Assign each object to one cluster (N objects, N clusters).
  2. Calculate the distances (similarities) between all objects. This can be done using different methods of linkage (e.g. Single linkage, complete linkage, average linkage, centroid method, or Ward’s method).
  3. Find the closest (most similar) pair of clusters and combine them into one single cluster; therefore, we get one cluster less.
  4. Calculate the distances (similarities) between the new cluster and others.
  5. Repeat steps 2, 3 and 4 until the moment when all objects are grouped together in one single cluster of size N (n-1 iterations).
24
Q

What is hierarchical clustering using single linkage?

A

min (10,20) = 10

the minimum distance of all possible combination of objects in the clusters

25
Q

What’s the final outcome of a single linkage?

A

one big cluster

26
Q

What is hierarchical clustering using complete linkage?

A

max (10,20) = 20

the maximum distance of all possible combination of objects in the clusters

27
Q

What is hierarchical clustering using centroid linkage?

A

d (Cluster m, Cluster k)

the distance between the centroids of clusters

28
Q

What is hierarchical clustering using average linkage?

A

average (10,20) =15

Aka group-average hierarchical clustering

29
Q

Effects of single linkage?

A

Good for detecting arbitrary-shaped clusters

cannot detect overlapping clusters

30
Q

Effects of complete-linkage?

A

good for detecting overlapping clusters

not for detecting arbitrary-shaped clusters

31
Q

effects of centroid and average linkage?

A

somewhere in between single linkage and complete linkage

32
Q

Practical issues of hierarchical clustering?

A

Works more efficiently with a pre-computed n by n proximity matrix
At least N^2 time and space complexity
> With 1000 records
- need 1 million time units to compute
- need 1 million memory units to store the proximity values