Block 2: Distance-based methods Flashcards
Steps from euclidean distances E to configurations X
- From E to B: put centroid at origin of data Xbar = 1^T X /n and set B = -1/2( I 11^T/n ) E ( I 11^T/n)
- From B to X: construct Y with ith columns f(i) = srqt(λi) e(i) where λi and e(i) are the eigenvalues and eigenvectors of B = sum(from i=1 to n) λi e(i) e(i)^T
Steps from configurations X to distances E and what information do we lose
- let em,l = sum (from v=1 to p) (Xm,v = Xl,v)^2 = X(m)^T X(m) + X(l)^T X(l) -2X(m)^T X(l) = bm,m + bl,l -2bm,l where X(m) is the mth row of X and bm,l = X(m)^T X(l)
- from X to B we lose orientation information: B = YY^T = (XP) (XP)^T = XX^T = B since P orthogonal rotation matrix
- from B to E we lose position information: replace X by W = (X - μ) and get same E with W(m)^T W(l) = X(m)^T X(l) - X(m)^T μ - μ^T X(l) + μ^Tμ
What is classical multidimensional scaling? What is the impact of eigenvalues on CMS?
- The method to obtain Y from E
- A test for Euclidean-ness (negative eigenvalues correspond to the non-Euclidean nature of data)
- A method for estimating dimensionality:
n’’ ≈ the number of bigger eigenvalues than the rest
or n’’ such that sum (from i=1 to n’’) λi ≈ tr(B)
or reject λi such that |λi| <= |λn|
How to run multidimensional scaling?
can use R function: cmdscale(distance object, k=max dimension, eig=TRUE) where the distance object E needs to be symmetrised
Give properties of a metric d(x,y)
- d(x,y) >= 0 and d(x,y) = 0 if x=y (not always)
- symmetric: d(x,y) = d(y,x)
- triangle inequality: d(x,y) + d(y,z) >= d(x,z)
keep in mind: if {dα} is a family of metrics then sumα (dα) is a metric
Define Hamming distance
The number of mismatches: d(x,y) = sum(from i=1 to n) di(x,y) = b+c, where di(x,y) = 1 if xi = yi and 0 otherwise
Define Jaccard distance
dJ(x,y) = (b+c) / (a+b+c) c = sum (1{x=0, y=1}) and b = sum (1{x=1, y=0}) a = sum (1{x=y=1}) and d = sum (1{x=y=0})
Define example of 5 other dissimilarities distance
- simple matching coefficient: (b+c) / (a+b+c+d)
- Manhattan distance:
- Canberra distance:
- Maximum: L∞ norm
- Gower’s smilirarity coefficient
What is the stress function?
Stress function is the degree of agreement of dissimilarities {δm,l} and created euclidean distances {dm,l}
. monotone linear regression to get fitted {d^m,l} in the same order as {δm,l}
. S(X) = sqrt(S/T) where S*= sum(m
Whate are the Miles Algorithm and Young’s boundary search algorithm?
Algorithms for the monotone linear regression resulting in increasing step function (each step value is the mean of the values in that “block”)
How do we chose configuration X from non-euclidean dissimilarities?
- compute stress function
- minimise it over all possible configurations
How to find optimal configuration from stress function?
- Initial configuration: random, L-shaped, ‘higher dimension K to lower’ or classical scaling solution
- Optimisation using gradient method with ▽S:
∂S / ∂xi,k = S/2 (1/S* (∂S/∂xi,k) - 1/T(∂T* / ∂xi,k))
with ∂T/∂xi,k and ∂S/∂xi,k in lecture notes with ∂xm,k / ∂xi,k = 1 if i=m and 0 otherwise - travel in the direction of ▽S to find optimal X
Mention advantage of ordinal scaling
It can cope with missing data
Steps of K-means clustering
- initialise with k centroids mj
- for each observations Xi assign it to closest centroid (by euclidean distance) and update centroid by mean
- stop when the clusters stay stable
What is self-organising maps SOM?
Similar to k-means but we assign and update values of closest neighbours instead of centroids