Variational Inference Flashcards

1
Q

The goal of Inference is to learn about latent (unknown) variables trough the posterior. Why is a analytic solution usually not an option?

A

The marginal integral is usually intractable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Name some options for posterior inference

A

MCMC sampling
Laplace approximation
Expectation propagation
Variational inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the main advantage of variational inference?

A

It is the most scalable method currently known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the main idea behind variational inference?

A

Approximate the true posterior by defining a family of approximate distrubution q_v and optimizing the variational parameters v.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the KL (Kullback Leibler) divergence?

A

KL(p(x)||q(x)) =
integ p(x) log(p(x)/q(x)) dx =
E[log(p(x)/q(x))]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is differential entropy?

A

H[q(x)]= -Eq[log q(x)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is the KL divergence often used in variational inference?

A

Use a KL(q(z) ||p(z|x)) as objective function to minimize

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does Jensens inequality state, and for what function do we often use this inequality?

A

For concave functions f:
f(E[x]) >= E[f(x)]

This is often used for logarithmes,
log E[x] >= E[log(x)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What can we do instead of minimizing KL(q(z) || p(z|x))?

A
We can maximise ELBO (Evidence Lower Bound):
Eq[log p(x|z)] - KL(q(z) ||p(z)) = 
Eq[log p(x,z)] + H[log q(z|v)] = 
Eq[log p(x,z)] - Eq[log q(z|v)] = 
Eq[log p(x,z) / q(z|v)]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a mean field approximation?

A

In a mean field approximation q(z|x) is fully factorized, meaning q(z|x) = prod q(z_i). The resulting distribution is with global parameters beta, and local paramters z_i is given by:
q(b, z|v) = q(B | lambda) prod (z_n | phi_n) with
v = [lambda, phi_1, phi_2…., phi_n]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In the mean field approximation the q-factors don’t depend directly on the data. How is the family of q’s connected to the data?

A

Trought the maximization of ELBO.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the algorithm for mean field approximation?

A
  1. Initialize parameters randomly
  2. Update local variational paramters
  3. Update global variational paramters
  4. Repeat.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the limitations of mean field approximation?

A

Generally mean field tend to be too compact, need a better class for approximation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Classical mean field approximation has to evaluate all datapoints to update parameters, making it unscaleable to large datasets. How can we leaviate this problem?

A

Use stochastic variational inference, updating the parameters with a stochastic subset of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do we maximize the ELBO?

A

Set q*(z_i) = exp(E{-i}[p(x,z)])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a important criterion for using mean field approximation with stochastic (noisy) gradients?

A

The gradients should be unbiased

17
Q

What is a problem with maximizing the ELBO?

A

ELBO is non-convex, we only get local minima.

18
Q

What is the natural gradient?

A

Uses one datapoint to compute the gradient of distrubutions with respect to the parameters v, for example the gradient of KL divergence with respect to v.

19
Q

What are some advantages of the natural gradinet?

A
  • Invariant to parametrization, for example variance vs precission.
  • Uses only one datapoint.
20
Q

How do we update the global parameter using noisy gradients?

A

Use a running mean so:

lambda_t

21
Q

What happens if we try to use the cavi to do Bayesian logistic regression?

A

We get a mean that can’t be calculated in closed form.

22
Q

Why can’t we use monte carlo approximation to calculate the intractable mean when doing inference on logistic regression?

A

Can’t push gradients trough monte carlo sampling. (We can optimize a lower bound instead, but the lower bound is model spesific).

23
Q

Why do we have to swap the order of the integraton (for ELBO expecations) and derivation in BBVI ( Black box VI)

A

The integrals are intractable for non-conjugate models, which makes gradient computation difficult (impossible?).

24
Q

What is the idae behind score function gradients?

A

Switch the order of integration and differentation, then simplify the expectation computations

25
Q

How can we practically calculate the score function gradient?

A

MC estimate

26
Q

What do we need to calculate the score funciton gradients?

A
  1. Sampling from q
  2. Evaluate score function gradient_v {log q(z|v)}
  3. Evaluate log q(z|v) and p(x,z)
27
Q

What is a problem with the score function gradients?

A

As we use MC sampling they are noisy. This can be leviated by for example control variates.

28
Q

If f: X -> Y so f(x) = y, how are integ{xdx} and integ{ydy} related?

A

integ{ydy} = det(dF/dx) integ{xdx}. The “area” is multiplied by the determinand of the Jacobian.

29
Q

What are the properties of pathwise gradient compared to score function gradient?

A

Lower variance, but more restricted model classes (differentiable models and z = t(e,v) )

30
Q

What is the score function ELBO gradient and the pathwise ELBO gradient

A

Score:
Eq[g(z, v) * dv log q(z|v)]
Pathwise:
Ep[dz g(z,v) * dv t(e,v)]

31
Q

What is the idea behind behind amortized inference?

A

Learn a mapping, f(x_n, theta) = phi_n, from datapoints to local parameters. This means:

  1. We do not need to learn any local parameters
  2. No more independent update of local and
  3. New theta can be found using SGD
32
Q

What is the idea of atuoregressive distributions?

A

we make z_i dependend on all former z_j with j < i.

33
Q

What is the idea of normalizing flows?

A

We aply k invertible transformations to q(z|v)