Exam Flashcards
1) What formula do we optimize to get the hyperparamters in GP
2) What is the derivative of this formula with respect to the hyperparameters?
1) log(p(y|X, theta) = -0.5y^TK^-1y - 0.5log |K|
K = k(X,X) + sigma^2I
2) 0.5 tr(aa^T - K^1) dK/dtheta, a = K^-1 * y
What is the formula for
1) The radial basis kernel?
2) The exponential kernel?
Let z = |x-x’|, then:
1) s^2exp(z^2/l^2)
2) s^2exp(z/l)
For GP, what is the complexity of:
1) Training (computational)?
2) Predicting (computational)?
3) The memory requirement?
1) O(N^3) (Matrix inversion)
2) O(N^2) (Matrix multiplication)
3) O(ND + 2N^2)
Why would we want to integrate out paramters in GP and how could we do it?
To avoid having to do GP optimization, the optimization is especially problematic if large parts of the space have similar marginal likelihood values. We can do this using MCMC.
What are the three most used aquisition functions for bayesian optimization with GP?
Let y(x) = f(x_best) - mu(x)) / std(x)
1) Probability of improvement a(x) = N_cdf(y(x)) 2) Expected improvement std(x)( y(x) N_cdf(y(x)) + N(y(x) | 0, 1) 3) GP Lower Confidence bound: a(x) = -(mu(x) - k*std(x))
What does it mean, and how do we prove that a markov chain:
1) If you reach a equilibrium distribution, we will stay in the equilibrium distribution?
2) Only has one equilibrium distribution?
1) By proving invariance. A sufficient conditon is detailed balance:
p(x)T(x’|x) = p(x’)T(x|x’)
2) By proving ergodicity, any state can be reached from any state.
What is jensens inequallity?
For a concav function f, we have f(E[x]) >= E[f(x)]. Especially log(E[x]) >= E[log(x)]
How do we calculate mu and Cov in Laplace?
p(z) = (1/Z)*p'(z) z' = mode for p'(z) mu = z', can use MAP estimate. Cov^-1 = - d2/d2z log p'(z*)
What is the formula for calculating Metropolis Hasting?
x’ ~ q(x’|x_y)
u ~ U[0,1]
(q(x_t |x’) p(x’)) / (q(x’ | x_t) p(x_t)) >= u
What is the formula for calculating the m_post and S_post in Gaussian processes?
m_post = m_x* + S_x*x(S_xx + sigma^2I)^-1(y - m_x) S_post = S_x*x* - S_x*x(S_xx + sigma^2I)^-1(S_xx*)
What is the formula for multiplying two gaussians?
S_new = (S_1^-1 + S_2^-1)^-1 mu_new = S_new(S_1^-1 mu_1 + S_2^-1 mu_2)
What could we do instead of maximizing marginal likelihood for GP maximization?
Integrate out hyperparameters using MCMC
What is the mean field approximation?
Use a distribution q(B, l) with all latent variables independent and goverend by their own variational parameters.
What is the optimal distribution in the coordinate ascent algorithm for mean field approximations?
log q*(z_j) = E-j [log p(z_j, z_{-j}, X)]
What is the algorithm for optimizing mean field?
- Input data,
- Initialize global paramters Lambda at random
- While ELBO not converged:
- For each datapoint x_i
Update local paramters
- Update global paramters