Variational Inference Flashcards
The goal of Inference is to learn about latent (unknown) variables trough the posterior. Why is a analytic solution usually not an option?
The marginal integral is usually intractable.
Name some options for posterior inference
MCMC sampling
Laplace approximation
Expectation propagation
Variational inference
What is the main advantage of variational inference?
It is the most scalable method currently known.
What is the main idea behind variational inference?
Approximate the true posterior by defining a family of approximate distrubution q_v and optimizing the variational parameters v.
What is the KL (Kullback Leibler) divergence?
KL(p(x)||q(x)) =
integ p(x) log(p(x)/q(x)) dx =
E[log(p(x)/q(x))]
What is differential entropy?
H[q(x)]= -Eq[log q(x)]
How is the KL divergence often used in variational inference?
Use a KL(q(z) ||p(z|x)) as objective function to minimize
What does Jensens inequality state, and for what function do we often use this inequality?
For concave functions f:
f(E[x]) >= E[f(x)]
This is often used for logarithmes,
log E[x] >= E[log(x)]
What can we do instead of minimizing KL(q(z) || p(z|x))?
We can maximise ELBO (Evidence Lower Bound): Eq[log p(x|z)] - KL(q(z) ||p(z)) = Eq[log p(x,z)] + H[log q(z|v)] = Eq[log p(x,z)] - Eq[log q(z|v)] = Eq[log p(x,z) / q(z|v)]
What is a mean field approximation?
In a mean field approximation q(z|x) is fully factorized, meaning q(z|x) = prod q(z_i). The resulting distribution is with global parameters beta, and local paramters z_i is given by:
q(b, z|v) = q(B | lambda) prod (z_n | phi_n) with
v = [lambda, phi_1, phi_2…., phi_n]
In the mean field approximation the q-factors don’t depend directly on the data. How is the family of q’s connected to the data?
Trought the maximization of ELBO.
What is the algorithm for mean field approximation?
- Initialize parameters randomly
- Update local variational paramters
- Update global variational paramters
- Repeat.
What are the limitations of mean field approximation?
Generally mean field tend to be too compact, need a better class for approximation
Classical mean field approximation has to evaluate all datapoints to update parameters, making it unscaleable to large datasets. How can we leaviate this problem?
Use stochastic variational inference, updating the parameters with a stochastic subset of the data.
How do we maximize the ELBO?
Set q*(z_i) = exp(E{-i}[p(x,z)])
What is a important criterion for using mean field approximation with stochastic (noisy) gradients?
The gradients should be unbiased
What is a problem with maximizing the ELBO?
ELBO is non-convex, we only get local minima.
What is the natural gradient?
Uses one datapoint to compute the gradient of distrubutions with respect to the parameters v, for example the gradient of KL divergence with respect to v.
What are some advantages of the natural gradinet?
- Invariant to parametrization, for example variance vs precission.
- Uses only one datapoint.
How do we update the global parameter using noisy gradients?
Use a running mean so:
lambda_t
What happens if we try to use the cavi to do Bayesian logistic regression?
We get a mean that can’t be calculated in closed form.
Why can’t we use monte carlo approximation to calculate the intractable mean when doing inference on logistic regression?
Can’t push gradients trough monte carlo sampling. (We can optimize a lower bound instead, but the lower bound is model spesific).
Why do we have to swap the order of the integraton (for ELBO expecations) and derivation in BBVI ( Black box VI)
The integrals are intractable for non-conjugate models, which makes gradient computation difficult (impossible?).
What is the idae behind score function gradients?
Switch the order of integration and differentation, then simplify the expectation computations
How can we practically calculate the score function gradient?
MC estimate
What do we need to calculate the score funciton gradients?
- Sampling from q
- Evaluate score function gradient_v {log q(z|v)}
- Evaluate log q(z|v) and p(x,z)
What is a problem with the score function gradients?
As we use MC sampling they are noisy. This can be leviated by for example control variates.
If f: X -> Y so f(x) = y, how are integ{xdx} and integ{ydy} related?
integ{ydy} = det(dF/dx) integ{xdx}. The “area” is multiplied by the determinand of the Jacobian.
What are the properties of pathwise gradient compared to score function gradient?
Lower variance, but more restricted model classes (differentiable models and z = t(e,v) )
What is the score function ELBO gradient and the pathwise ELBO gradient
Score:
Eq[g(z, v) * dv log q(z|v)]
Pathwise:
Ep[dz g(z,v) * dv t(e,v)]
What is the idea behind behind amortized inference?
Learn a mapping, f(x_n, theta) = phi_n, from datapoints to local parameters. This means:
- We do not need to learn any local parameters
- No more independent update of local and
- New theta can be found using SGD
What is the idea of atuoregressive distributions?
we make z_i dependend on all former z_j with j < i.
What is the idea of normalizing flows?
We aply k invertible transformations to q(z|v)