Block 4: Unsupervised Learning Flashcards

1
Q

What is the flaw of PCA?

A

PCA looks for components with largest variance (by eigendecomposition) but sometimes large variance is not the most interesting, we could be more interested in clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does entropy measure “interestingness” with multimodal distribution (unlike Gaussian)?

A
  • Kullback-Leibler Divergence: DKL (g||f) = ∫g(x) log{g(x)/f(x) dx ( =0 for perfect g(x) = f(x))
  • Entropy: H(g) = - ∫g(x) log{g(x)} dx

Compare entropy of densities with same variance σ^2 and mean 0.
Minimised by optimal g(.) and maximised by Gaussian (= most boring).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain sphering

A

Aim is to make the variance matrix the identity I

  • take R = S^(1/2) where S is sample variance S = 1/n X^T X
  • sphering step: W = XR
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain Exploratory Projection Pursuit (computational technique)

A
  • Center and sphere matrix X to get W
  • Set projected matrix U_a = Wa with unit vector a
  • Form density estimate f^u,a (using block 3)
  • Find minimiser a of entropy H{f^u,a}: initial random unit a and optimise (repeat because initial a can impact a lot the “optimal” result)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain Independent Component Analysis

A

Using entropy minimisation: make columns of S statistically independent and non-Gaussian:

  • whitening (sphere X) for identity covariance
  • Write S = XA for an A orthogonal
  • want to minimise mutual information I(Y) = sum(j=1 to p) {H(Yj) - H(X)} over A for Y = A^TX and therefore minimise sum(j=1 to p) H(Yj)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain flaw of eigendecomposition for ICA

A

Using eigendecomposition:
writing X = SA^T where S = √n U and A^T = DV^T /√n from X = UDV^T where vectors of S are uncorrelated but decomposition is not unique (S can be rotated)!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Express the Projection Pursuit Regression

A

f(x) = sum(m=1 to M) gm( wm^T X)

  • initialise data directions {w}
  • find {g} good smooth functions using smoothing splines
  • find {w} by minimising sum of squres error by weighted least squares
  • repeat until convergence and repeat for {wm} and {gm} if M > 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain structure of neural networks

A
  • initialisation activation sigmoid Zm = σ(X) = σ (α0 + α^T X) ( σ not linear)
  • one hidden layer Z
  • output function to get target: Y= β0 + β^T Z (linear)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Compare PPR and ANN

A
  • PPR uses non-parametric {gm} whereas sigmoid {σ} are less complex but use parameters {αk} and {βk} called weights
  • ANN also minimising sum of squares using gradient descent alternating back propagation to update parameters (using gradient of model and learning rate) and forward propagation to predict and compute sum of squares errors with current parameters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the disadvantages of ANN?

A
  • tend to overfit with too many parameters
  • solutions of very sensible to input
  • hard to know number of hidden layers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly