Exam Flashcards

Question 1

Q

1) What formula do we optimize to get the hyperparamters in GP
2) What is the derivative of this formula with respect to the hyperparameters?

Answer

A

1) log(p(y|X, theta) = -0.5y^TK^-1y - 0.5log |K|
K = k(X,X) + sigma^2I
2) 0.5 tr(aa^T - K^1) dK/dtheta, a = K^-1 * y

Question 2

Q

What is the formula for

1) The radial basis kernel?
2) The exponential kernel?

Answer

A

Let z = |x-x’|, then:

1) s^2exp(z^2/l^2)
2) s^2exp(z/l)

Question 3

Q

For GP, what is the complexity of:

1) Training (computational)?
2) Predicting (computational)?
3) The memory requirement?

Answer

A

1) O(N^3) (Matrix inversion)
2) O(N^2) (Matrix multiplication)
3) O(ND + 2N^2)

Question 4

Q

Why would we want to integrate out paramters in GP and how could we do it?

Answer

A

To avoid having to do GP optimization, the optimization is especially problematic if large parts of the space have similar marginal likelihood values. We can do this using MCMC.

Question 5

Q

What are the three most used aquisition functions for bayesian optimization with GP?

Answer

A

Let y(x) = f(x_best) - mu(x)) / std(x)

1) Probability of improvement 
a(x) = N_cdf(y(x))
2) Expected improvement
std(x)( y(x) N_cdf(y(x)) + N(y(x) | 0, 1)
3) GP Lower Confidence bound:
a(x) = -(mu(x) - k*std(x))

Question 6

Q

What does it mean, and how do we prove that a markov chain:

1) If you reach a equilibrium distribution, we will stay in the equilibrium distribution?
2) Only has one equilibrium distribution?

Answer

A

1) By proving invariance. A sufficient conditon is detailed balance:
p(x)T(x’|x) = p(x’)T(x|x’)
2) By proving ergodicity, any state can be reached from any state.

Question 7

Q

What is jensens inequallity?

Answer

A

For a concav function f, we have f(E[x]) >= E[f(x)]. Especially log(E[x]) >= E[log(x)]

Question 8

Q

How do we calculate mu and Cov in Laplace?

Answer

A

p(z) = (1/Z)*p'(z)
z' = mode for p'(z)
mu = z', can use MAP estimate. 
Cov^-1 = - d2/d2z log p'(z*)

Question 9

Q

What is the formula for calculating Metropolis Hasting?

Answer

A

x’ ~ q(x’|x_y)
u ~ U[0,1]
(q(x_t |x’) p(x’)) / (q(x’ | x_t) p(x_t)) >= u

Question 10

Q

What is the formula for calculating the m_post and S_post in Gaussian processes?

Answer

A

m_post = m_x* + S_x*x(S_xx + sigma^2I)^-1(y - m_x)
S_post = S_x*x* - S_x*x(S_xx + sigma^2I)^-1(S_xx*)

Question 11

Q

What is the formula for multiplying two gaussians?

Answer

A

S_new = (S_1^-1 + S_2^-1)^-1
mu_new = S_new(S_1^-1 mu_1 + S_2^-1 mu_2)

Question 12

Q

What could we do instead of maximizing marginal likelihood for GP maximization?

Answer

A

Integrate out hyperparameters using MCMC

Question 13

Q

What is the mean field approximation?

Answer

A

Use a distribution q(B, l) with all latent variables independent and goverend by their own variational parameters.

Question 14

Q

What is the optimal distribution in the coordinate ascent algorithm for mean field approximations?

Answer

A

log q*(z_j) = E-j [log p(z_j, z_{-j}, X)]

Question 15

Q

What is the algorithm for optimizing mean field?

Answer

A

Input data,
Initialize global paramters Lambda at random
While ELBO not converged:
- For each datapoint x_i
Update local paramters
- Update global paramters

Question 16

Q

Name one challenge with mean field coordinate ascent?

Answer

A

Inefficient, need to look at the full dataset.

Question 17

Q

What is the natural gradient?

Answer

A

F^-1 grad(F(v)), F Fisher information matrix

Points in the direction of largest change in distribution space, used to maximise ELBO/ Minimize KL.

Question 18

Q

Why would we want to use the natural gradient?

Answer

A

Invariant to parametrization
Scales each parameter individually
Easy to compute for conditional conjugate models
Unbiased
Only depends on parameters of a single datapoint, cheap to compute.

Question 19

Q

What is the algorithm for stochastic VI optimization?

Answer

A

Input data p(B, z, X)
Initialize global parameters Lambda at random
Repeat:
- Sample datapoint x_n uniformly
- Update local parameter phi_n
- Update global using a fraction of new optimal global.

Question 20

Q

What is the ELBO?

Answer

A

Eq[log p(x,z) / q(z, v)]

Question 21

Q

What is usually the limitation of classical VI, and how can this be solved?

Answer

A

We can’t calculate the expectation in the ELBO for non-conjugate models. This can be solved by:

Lower bounding (Model spesific)
Using BBVI

Question 22

Q

Why do we wan’t to switch the order of differentiation and expectation in BBVI?

Answer

A

If we can’t solve the expecation we can use MCMC sampling. But propagating the gradient trough MCMC is impossible, so we want to do derivation before sampling.

Question 23

Q

Describe the BBVI algorithm using score function.

Answer

A

1.input model p(z, X) and VI approximation q(z|v)
2. Repeat:
- Draw samples z from q
- Update using stochastic gradient:
(1/S) sum grad_v log q(z|v) g(z,v))

Question 24

Q

In pathwise gradients, what is the reparametrization trick?

Answer

A

Let z = t(e, v) with e beeing a simple base distribution. This lets us take advantage of
grad_v E_q[f(z)] = E_p(z)[grad_v f(t(e,v))]

Question 25

Q

What are the cons/ advantages of pathwise gradient over score function?

Answer

A

The advantage is that variance generally is lower so the convergence is quicker. The cons are that z has to be expressed as a (differentiable) transformation z = t(e, v) and the model has to be differentiable.

Question 26

Q

Describe the BBVI algorithm using pathwise gradient function.

Answer

A

1.input model p(z, X) and VI approximation q(z|v)
2. Repeat:
- Draw samples z from q
- Update using stochastic gradient:
(1/S) sum grad_z g(z,v) grad_v t(e,v)

Question 27

Q

What challenge does amortized inference adress?

Answer

A

SVI gets has to update local parameters and global parameters pr. datapoint and then optimize. Amortized inference adresses this by learning a mapping from global parameters theta to local parameters, so f(x_i, tehta) = phi_i. f is usually a (deep) neural network.

Question 28

Q

How can we get richer approximations than mean-field?

Answer

A

Use structured meanfield, for example:

Autoregressive distribution, where z_i depends on all previous variables.
Normalizing flows, send distribution trough a series of invertible transformations.

Question 29

Q

What is the normalizing constant for clique potentials in MRF?

Answer

A

Z = sum_x prod_c x_c