Chapter 4 Flashcards

1
Q

Why is clustering regarded as unsupervised learning?(1)

A

There are no labelled observations which can be used to learn about the relationship between observations and group labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Dissimilarity measure properties. Where are they stored?(2)

A
Stored in distance matrix D
Properties:
1. dij ≥ 0 for all (i, j);
2. dii = 0 for all i;
3. if dij = 0 then xi = xj .
In addition, if xi and xj are more “similar” than xk and xl, then dij should be smaller than dkl. Many clustering algorithms also require that the dissimilarities are symmetric, i.e. dij = dji.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Euclidean distance? When may this be inappropriate to use?(2)

A

{sum from p to k=1(xik − xjk)^2}^1/2

The Euclidean distance might not be appropriate when some variables are highly correlated or vary on very different scales-standardisation of variables can again help this or malhalanobis distance can help this and high correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Malhalanobis distance?(1)

A

dij =sqrt{(xi − xj)^T S^−1(xi − xj)}

S is sample variance matrix of X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When is a correlation-based distance measure appropriate? How would you calculate this?(2)

A

When absolute differences aren’t important but proportions/patterns-based on the correlation matrix between data vectors (note: not between variables). This matrix will have n rows and n columns and its(i, j)-th entry, denoted robsij , is the sample correlation between xi and xj. dij=1-rijobs thus small dij denotes greater similarity between data pairs
To compute this we need the correlations between observations, not variables, and so need to
transpose the data matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the Manhattan (taxi-cab) distance measure?(1)

A

Sum from p to k=1{|xik − xjk|}

This assigns less weight to metrics with large differences in one or more components hence useful with outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the k-means algorithm?(3)

A
  1. Initialisation: choose K “means” m[0]1 , m[0]2 ,…,m[0]
    K randomly, e.g. by randomly choosing K of the n observations, without replacement.
  2. Iteration: repeat for j = 1, 2,… until a stopping criterion
    has been satisfied:
    (a) Reassignment: compute the cluster partitions C[j]1 , C[j]2 ,…,C[j] K by going through each observation xi in
    turn and allocating i to C[j] k if m[j−1] k is the closest mean
    in terms of Euclidean distance, i.e. if
    ||xi − mk^[j−1]||
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Is K means guaranteed to reach a global minimum?(1)

A

NO! Local minimum gurantee not maximum. For this reason it is suggested that the algorithm is ran a few times from different initial means before selection of smallest SSw.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a way to visualise K means structuring?(1)

A

Plot data with first 2 PCs with points labelled according to cluster
**also do this for hierarchal and useful for determining chaining problems (one PC/group will dominate)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some practical implications of K-means?(2)

A
  • choosing K may b difficult, way around this is to plot K against SSw snd try look for a kink in the data where we appear to get diminishing returns
  • Not scale invariant and can impose spherical nature to the data-favouring the identification of hyperspherical clusters, even if clusters appear to have shapes which are very different from this. Can get around this through Malhalanobis transformation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What advantage does hierarchal agglomerative clustering have over K-means?(2)

A
  • Don’t need to specify K

- More visual way with dendrogram construction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the hierarchal agglomerative clustering algorithm?(3)

A
  1. Initialisation: start with n clusters, C1, C2,…,Cn with Ci
    containing the single observation xi. The distance matrix
    for the n clusters is simply the distance matrix for the n
    observations, i.e. the matrix of pairwise dissimilarities.
  2. Iteration: for i = n, n − 1,…, 2:
    (a) Fusion: find the minimum distance djk between the i
    clusters. (If there is more than one minimum, pick one
    uniformly at random). Fuse clusters Cj and Ck into a single cluster and remove the jth and kth rows and columns
    from the distance matrix. The distance djk indicates the
    height in the dendrogram at which the fusion should be
    placed.
    (b) Recomputation: use the chosen linkage method to calculate the distances between the new cluster and the remaining clusters. Record the i−1 distances in a new row
    and column added to the distance matrix.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are linkage methods? List the 3 most common types.(4)

A

-In agglomerative clustering you work backwards fusing clusters with smallest distances, linkage methods are the ways in which you determine which to fuse, three common ones include:
-single-The minimum distance between an observation in
cluster A and an observation in cluster B.
-complete-The maximum distance between an observation in cluster A and an observation in cluster B
-average-The mean distance between an observation in cluster A and an observation in cluster B.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which of the linkage methods is not invariant under a monotone transformation?(1)

A

Note that single and complete linkage are invariant under any monotone (i.e. order preserving) transformation of the original distances between observations. This is not true of average linkage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does cutting the dendrogram do?(1)

A

Partitions dendogram into non-overlapping clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some practical implications of hierarchal clustering?(2)

A
  • Choice of cut-heuristic way is to look for large chsnge in distance for fusion
  • Choice of linkage method (due to sensitivity of dissimilarity and linkage method chosen)-e.g. chaining problems
17
Q

What is chaining?(1)

A

Common in single linkage methods whereby nearby observations are fused one by one leading to long thin clusters

18
Q

Which linkage method is the best?(1)

A

Complete tend to be the best and tend to provide the most balanced dendograms so are generally preferred, average interpolate between these and single and single can have chaining. However, always useful to run through diff linkage methods (and dissimilarity measures) to decide which as hierarchal structure is an artefact of the clustering method.