Block 2: Distance-based methods Flashcards

Question 1

Q

Steps from euclidean distances E to configurations X

Answer

A

From E to B: put centroid at origin of data Xbar = 1^T X /n and set B = -1/2( I 11^T/n ) E ( I 11^T/n)
From B to X: construct Y with ith columns f(i) = srqt(λi) e(i) where λi and e(i) are the eigenvalues and eigenvectors of B = sum(from i=1 to n) λi e(i) e(i)^T

Question 2

Q

Steps from configurations X to distances E and what information do we lose

Answer

A

let em,l = sum (from v=1 to p) (Xm,v = Xl,v)^2 = X(m)^T X(m) + X(l)^T X(l) -2X(m)^T X(l) = bm,m + bl,l -2bm,l where X(m) is the mth row of X and bm,l = X(m)^T X(l)
from X to B we lose orientation information: B = YY^T = (XP) (XP)^T = XX^T = B since P orthogonal rotation matrix
from B to E we lose position information: replace X by W = (X - μ) and get same E with W(m)^T W(l) = X(m)^T X(l) - X(m)^T μ - μ^T X(l) + μ^Tμ

Question 3

Q

What is classical multidimensional scaling? What is the impact of eigenvalues on CMS?

Answer

A

The method to obtain Y from E
A test for Euclidean-ness (negative eigenvalues correspond to the non-Euclidean nature of data)
A method for estimating dimensionality:
n’’ ≈ the number of bigger eigenvalues than the rest
or n’’ such that sum (from i=1 to n’’) λi ≈ tr(B)
or reject λi such that |λi| <= |λn|

Question 4

Q

How to run multidimensional scaling?

Answer

A

can use R function: cmdscale(distance object, k=max dimension, eig=TRUE) where the distance object E needs to be symmetrised

Question 5

Q

Give properties of a metric d(x,y)

Answer

A

d(x,y) >= 0 and d(x,y) = 0 if x=y (not always)
symmetric: d(x,y) = d(y,x)
triangle inequality: d(x,y) + d(y,z) >= d(x,z)

keep in mind: if {dα} is a family of metrics then sumα (dα) is a metric

Question 6

Q

Define Hamming distance

Answer

A

The number of mismatches: d(x,y) = sum(from i=1 to n) di(x,y) = b+c, where di(x,y) = 1 if xi = yi and 0 otherwise

Question 7

Q

Define Jaccard distance

Answer

A

dJ(x,y) = (b+c) / (a+b+c) 
c = sum (1{x=0, y=1}) and b = sum (1{x=1, y=0})
a = sum (1{x=y=1}) and d = sum (1{x=y=0})

Question 8

Q

Define example of 5 other dissimilarities distance

Answer

A

simple matching coefficient: (b+c) / (a+b+c+d)
Manhattan distance:
Canberra distance:
Maximum: L∞ norm
Gower’s smilirarity coefficient

Question 9

Q

What is the stress function?

Answer

A

Stress function is the degree of agreement of dissimilarities {δm,l} and created euclidean distances {dm,l}
. monotone linear regression to get fitted {d^m,l} in the same order as {δm,l}
. S(X) = sqrt(S/T) where S*= sum(m

Question 10

Q

Whate are the Miles Algorithm and Young’s boundary search algorithm?

Answer

A

Algorithms for the monotone linear regression resulting in increasing step function (each step value is the mean of the values in that “block”)

Question 11

Q

How do we chose configuration X from non-euclidean dissimilarities?

Answer

A

compute stress function

- minimise it over all possible configurations

Question 12

Q

How to find optimal configuration from stress function?

Answer

A

Initial configuration: random, L-shaped, ‘higher dimension K to lower’ or classical scaling solution
Optimisation using gradient method with ▽S:
∂S / ∂xi,k = S/2 (1/S* (∂S/∂xi,k) - 1/T(∂T* / ∂xi,k))
with ∂T/∂xi,k and ∂S/∂xi,k in lecture notes with ∂xm,k / ∂xi,k = 1 if i=m and 0 otherwise
travel in the direction of ▽S to find optimal X

Question 13

Q

Mention advantage of ordinal scaling

Answer

A

It can cope with missing data

Question 14

Q

Steps of K-means clustering

Answer

A

initialise with k centroids mj
for each observations Xi assign it to closest centroid (by euclidean distance) and update centroid by mean
stop when the clusters stay stable

Question 15

Q

What is self-organising maps SOM?

Answer

A

Similar to k-means but we assign and update values of closest neighbours instead of centroids

Question 16

Q

What is Procrustes Analysis?

Answer

Study These Flashcards

A

Find the best configuration Y such that G(X,Y) = sum_k sum_i (Xi,k - Yi,k)^2 is minimised under translation, rotation and scale change.

Question 17

Q

Steps of Procrustes

Answer

Study These Flashcards

A

Translation: G(X,Y) minimised when ȳk = x̄k (matching centroids)
Rotation: G(X,PY) minimised when P* = UV^T for SVD of Y^TX = UΣV^T (because biggest Si,i of orthogonal S is 1 and therefore S=I)
Scale change: G(X,αY) minimised when α* = sum_i,k Xi,k Yi,k /sum_i,k Yi,k^2

Question 18

Q

What is the Genral Procrustes Analysis algorithm?

Answer

Study These Flashcards

A

(end of Block 2)

compute M, the mean of all the Yi (each modified using Procrustes)
repeat until convergence

Block 2: Distance-based methods Flashcards

(18 cards)