Exam Flashcards

1
Q

1) What formula do we optimize to get the hyperparamters in GP
2) What is the derivative of this formula with respect to the hyperparameters?

A

1) log(p(y|X, theta) = -0.5y^TK^-1y - 0.5log |K|
K = k(X,X) + sigma^2I
2) 0.5 tr(a
a^T - K^1) dK/dtheta, a = K^-1 * y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the formula for

1) The radial basis kernel?
2) The exponential kernel?

A

Let z = |x-x’|, then:

1) s^2exp(z^2/l^2)
2) s^2
exp(z/l)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

For GP, what is the complexity of:

1) Training (computational)?
2) Predicting (computational)?
3) The memory requirement?

A

1) O(N^3) (Matrix inversion)
2) O(N^2) (Matrix multiplication)
3) O(ND + 2N^2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why would we want to integrate out paramters in GP and how could we do it?

A

To avoid having to do GP optimization, the optimization is especially problematic if large parts of the space have similar marginal likelihood values. We can do this using MCMC.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the three most used aquisition functions for bayesian optimization with GP?

A

Let y(x) = f(x_best) - mu(x)) / std(x)

1) Probability of improvement 
a(x) = N_cdf(y(x))
2) Expected improvement
std(x)( y(x) N_cdf(y(x)) + N(y(x) | 0, 1)
3) GP Lower Confidence bound:
a(x) = -(mu(x) - k*std(x))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does it mean, and how do we prove that a markov chain:

1) If you reach a equilibrium distribution, we will stay in the equilibrium distribution?
2) Only has one equilibrium distribution?

A

1) By proving invariance. A sufficient conditon is detailed balance:
p(x)T(x’|x) = p(x’)T(x|x’)
2) By proving ergodicity, any state can be reached from any state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is jensens inequallity?

A

For a concav function f, we have f(E[x]) >= E[f(x)]. Especially log(E[x]) >= E[log(x)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do we calculate mu and Cov in Laplace?

A
p(z) = (1/Z)*p'(z)
z' = mode for p'(z)
mu = z', can use MAP estimate. 
Cov^-1 = - d2/d2z log p'(z*)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the formula for calculating Metropolis Hasting?

A

x’ ~ q(x’|x_y)
u ~ U[0,1]
(q(x_t |x’) p(x’)) / (q(x’ | x_t) p(x_t)) >= u

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the formula for calculating the m_post and S_post in Gaussian processes?

A
m_post = m_x* + S_x*x(S_xx + sigma^2I)^-1(y - m_x)
S_post = S_x*x* - S_x*x(S_xx + sigma^2I)^-1(S_xx*)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the formula for multiplying two gaussians?

A
S_new = (S_1^-1 + S_2^-1)^-1
mu_new = S_new(S_1^-1 mu_1 + S_2^-1 mu_2)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What could we do instead of maximizing marginal likelihood for GP maximization?

A

Integrate out hyperparameters using MCMC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the mean field approximation?

A

Use a distribution q(B, l) with all latent variables independent and goverend by their own variational parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the optimal distribution in the coordinate ascent algorithm for mean field approximations?

A

log q*(z_j) = E-j [log p(z_j, z_{-j}, X)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the algorithm for optimizing mean field?

A
  1. Input data,
  2. Initialize global paramters Lambda at random
  3. While ELBO not converged:
    - For each datapoint x_i
    Update local paramters
    - Update global paramters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Name one challenge with mean field coordinate ascent?

A

Inefficient, need to look at the full dataset.

17
Q

What is the natural gradient?

A

F^-1 grad(F(v)), F Fisher information matrix

Points in the direction of largest change in distribution space, used to maximise ELBO/ Minimize KL.

18
Q

Why would we want to use the natural gradient?

A
  • Invariant to parametrization
  • Scales each parameter individually
  • Easy to compute for conditional conjugate models
  • Unbiased
  • Only depends on parameters of a single datapoint, cheap to compute.
19
Q

What is the algorithm for stochastic VI optimization?

A
  1. Input data p(B, z, X)
  2. Initialize global parameters Lambda at random
  3. Repeat:
    - Sample datapoint x_n uniformly
    - Update local parameter phi_n
    - Update global using a fraction of new optimal global.
20
Q

What is the ELBO?

A

Eq[log p(x,z) / q(z, v)]

21
Q

What is usually the limitation of classical VI, and how can this be solved?

A

We can’t calculate the expectation in the ELBO for non-conjugate models. This can be solved by:

  1. Lower bounding (Model spesific)
  2. Using BBVI
22
Q

Why do we wan’t to switch the order of differentiation and expectation in BBVI?

A

If we can’t solve the expecation we can use MCMC sampling. But propagating the gradient trough MCMC is impossible, so we want to do derivation before sampling.

23
Q

Describe the BBVI algorithm using score function.

A

1.input model p(z, X) and VI approximation q(z|v)
2. Repeat:
- Draw samples z from q
- Update using stochastic gradient:
(1/S) sum grad_v log q(z|v) g(z,v))

24
Q

In pathwise gradients, what is the reparametrization trick?

A

Let z = t(e, v) with e beeing a simple base distribution. This lets us take advantage of
grad_v E_q[f(z)] = E_p(z)[grad_v f(t(e,v))]

25
Q

What are the cons/ advantages of pathwise gradient over score function?

A

The advantage is that variance generally is lower so the convergence is quicker. The cons are that z has to be expressed as a (differentiable) transformation z = t(e, v) and the model has to be differentiable.

26
Q

Describe the BBVI algorithm using pathwise gradient function.

A

1.input model p(z, X) and VI approximation q(z|v)
2. Repeat:
- Draw samples z from q
- Update using stochastic gradient:
(1/S) sum grad_z g(z,v) grad_v t(e,v)

27
Q

What challenge does amortized inference adress?

A

SVI gets has to update local parameters and global parameters pr. datapoint and then optimize. Amortized inference adresses this by learning a mapping from global parameters theta to local parameters, so f(x_i, tehta) = phi_i. f is usually a (deep) neural network.

28
Q

How can we get richer approximations than mean-field?

A

Use structured meanfield, for example:

  1. Autoregressive distribution, where z_i depends on all previous variables.
  2. Normalizing flows, send distribution trough a series of invertible transformations.
29
Q

What is the normalizing constant for clique potentials in MRF?

A

Z = sum_x prod_c x_c