Cluster Analysis Flashcards

1
Q

Is cluster analysis a supervised or an unsupervised learning method?

A

Unsupervised learning method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Contrast cluster analysis with PCA

A

PCA: looks for a low dimensional representation that explains most of the variance
Cluster Analysis: tries to group the observations into a small number of groups of similar observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Give examples of application of cluster analysis

A

. Marketing
. Medical
. Actuarial modeling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Give two clustering methods

A

. K-means

. Hierarchical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define K-means clustering

A

Group the observations in a data set into K disjoint clusters in which the observations are relatively homogenous.
We decide the number of cluster in advance, they are exhaustive (every observation belongs to one cluster) and mutually exclusive.
The clusters are selected to minimize the total dissimilarities between points within the cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the centroid of a cluster and how can you compute it?

A

Point whose coordinates are the means of the coordinates of the cluster
Formula: ( (sum(xi)/n ; sum(yi)/n )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the algorithm for K-means clustering?

A
  1. Split the observations arbitrarily into K clusters
  2. Calculate the centroids for each cluster
  3. Create new clusters by associating each points with the nearest centroid
  4. Repeat steps 2 and 3 until the clusters don’t change
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the dissimilarity function for k-means clustering?

A

Twice the Euclidean distance of points from the centroid.

Assigning points to the closest centroid can only reduce the total dissimilarity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define hierarchical clustering

A

Consists of a series of fusions of observations results in bigger clusters containing smaller clusters containing smaller clusters.
Does not require to specify the number of clusters in advance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe bottom-up / agglomerative clustering

A

The algorithm starts with n clusters (1 for each observation) and iteratively fuses the two most similar clusters together until all points are in one cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you decide how many clusters should be used in bottom-up clustering?

A

A dendrogram is produced. The number of clusters is determined by deciding at what height to cut the graph.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a dissimilarity measure?

A

A formula that measures how different two points are.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the most common dissimilarity measure?

A

Euclidean distance (square root of the sum of square differences between coordinates)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In what type of cluster analysis do we square the Euclidean distance?

A

K-means only

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why can’t we square the Euclidean distance

A

in hierarchical clustering, a dendrogram in which the height of each fusion is the dissimilarity measure. We don’t want to spoil the scale of the graph by squaring the distance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does the fusion algorithm work?

A

At each iteration, we compare every paire of clusters. If there is K clusters, we male k(k-1)/2 comparaisons.
We select the paire of clusters with the smallest dissimilarity measure and fuse them

17
Q

A dissimilarity between two points is defined by the Euclidean distance (or the square of the Euclidean distance). How is the dissimilarity between two cluster defined?

A

The dissimilarity between two clusters is called linkage.

18
Q

What are the different type of linkage?

A
  1. Complete
  2. Single
  3. Average
  4. Centroid
19
Q

Define complete linkage

A

Calculate the dissimilarity between every points of cluster A and every points of cluster B (is there are a points in cluster A and b points in cluster b, do axb calculations). The dissimilarity is the maximum of these numbers.

20
Q

Define single linkage

A

Calculate the dissimilarity between every points of cluster A and every points of cluster B (is there are a points in cluster A and b points in cluster b, do axb calculations). The dissimilarity is the minimum of these numbers.
This linkage leads to trailing clusters (clusters in which one point at a time is fused to a single cluster)

21
Q

Define average linkage

A

Calculate the dissimilarity between every points of cluster A and every points of cluster B (is there are a points in cluster A and b points in cluster b, do axb calculations). The dissimilarity is the average of these numbers.

22
Q

Define centroid linkage

A

Calculate the centroid of each clusters and use the dissimilarity between the centroids.

23
Q

What is the disadvantage of centroid linkage?

A

Can lead to inversion : a later fusion occurs at a height lower than an earlier fusion - the dissimilarity of a later fusion is lower than the dissimilarity of an earlier fusion involving the same points.

24
Q

True or false : the first link is the same regardless of linkage

A

True

25
Q

Why does complete linkage usually prefer to fuse smaller cluster together first?

A

The more points, the higher the maximum tends to be and the fusion is based on the minimum of the measure used.

26
Q

Why does single linkage usually prefer to fuse groups with greater number of observations first?

A

Opposite to complete linkage: the more observation there is in a cluster, the lower the minimum tends to be.

27
Q

Dose average linkage have a bias towards linking smaller of bigger groups first?

A

No

28
Q

What is the issue in hierarchical clustering?

A

It assumes that there is a hierarchy in the variables. If they are unrelated, this may not be a valid assumption. K-means clustering will be superior in those cases.

29
Q

What is another dissimilarity measure that can be used in hierarchical clustering?

A

Correlation

30
Q

Why does it take at least 3 features to use correlation as a dissimilarity measure?

A

Any 2 numbers are perfectly correlated with any other 2 numbers.

31
Q

Scale of variable is important. What action can be taken to ensure scale is considered?

A

One may standardize all variables to have standard deviation of 1

32
Q

Is clustering robust?

A

No. A small deviation in data can result in different clusters.

33
Q

True or false: one may perform clustering analysis on PCA score vectors instead of on the original data.

A

True