Block 4: Unsupervised Learning Flashcards
What is the flaw of PCA?
PCA looks for components with largest variance (by eigendecomposition) but sometimes large variance is not the most interesting, we could be more interested in clusters.
How does entropy measure “interestingness” with multimodal distribution (unlike Gaussian)?
- Kullback-Leibler Divergence: DKL (g||f) = ∫g(x) log{g(x)/f(x) dx ( =0 for perfect g(x) = f(x))
- Entropy: H(g) = - ∫g(x) log{g(x)} dx
Compare entropy of densities with same variance σ^2 and mean 0.
Minimised by optimal g(.) and maximised by Gaussian (= most boring).
Explain sphering
Aim is to make the variance matrix the identity I
- take R = S^(1/2) where S is sample variance S = 1/n X^T X
- sphering step: W = XR
Explain Exploratory Projection Pursuit (computational technique)
- Center and sphere matrix X to get W
- Set projected matrix U_a = Wa with unit vector a
- Form density estimate f^u,a (using block 3)
- Find minimiser a of entropy H{f^u,a}: initial random unit a and optimise (repeat because initial a can impact a lot the “optimal” result)
Explain Independent Component Analysis
Using entropy minimisation: make columns of S statistically independent and non-Gaussian:
- whitening (sphere X) for identity covariance
- Write S = XA for an A orthogonal
- want to minimise mutual information I(Y) = sum(j=1 to p) {H(Yj) - H(X)} over A for Y = A^TX and therefore minimise sum(j=1 to p) H(Yj)
Explain flaw of eigendecomposition for ICA
Using eigendecomposition:
writing X = SA^T where S = √n U and A^T = DV^T /√n from X = UDV^T where vectors of S are uncorrelated but decomposition is not unique (S can be rotated)!
Express the Projection Pursuit Regression
f(x) = sum(m=1 to M) gm( wm^T X)
- initialise data directions {w}
- find {g} good smooth functions using smoothing splines
- find {w} by minimising sum of squres error by weighted least squares
- repeat until convergence and repeat for {wm} and {gm} if M > 1
Explain structure of neural networks
- initialisation activation sigmoid Zm = σ(X) = σ (α0 + α^T X) ( σ not linear)
- one hidden layer Z
- output function to get target: Y= β0 + β^T Z (linear)
Compare PPR and ANN
- PPR uses non-parametric {gm} whereas sigmoid {σ} are less complex but use parameters {αk} and {βk} called weights
- ANN also minimising sum of squares using gradient descent alternating back propagation to update parameters (using gradient of model and learning rate) and forward propagation to predict and compute sum of squares errors with current parameters
What are the disadvantages of ANN?
- tend to overfit with too many parameters
- solutions of very sensible to input
- hard to know number of hidden layers