6_Variational Inference Flashcards
ELBO (evidence lower bound; marginal likelihood lower bound)
F(v) := E_q[log( p(x,z) / q(z) )]
F(v) = E_q[log( p(x|y) )] - KL(q(z) || p(y))
Score Function
score function (log derivative trick)
\nabla_v log(q(z|v)) = \nabla_v q(z|v) / q(z|v)
\Leftrightarrow \nabla_v q(z|v) = \nabla_v log(q(z|v)) * q(z|v)
Natural Gradient
\tilde{\nabla}_v F(v) = F^{-1} \nabla_v F(v)
F^{-1} = (Fisher information matrix)^{-1}
Noisy Updates of Variational Parameters
v_{t+1} = v_t + \rho_t \hat{\nabla}_v F(v)
\nabla_v ELBO using the score function
\nabla_v F(v) = E_q[\nabla_v log(q(z|v)) * ( log(p(x,z) - log(q(z|v) )]
Use Monte Carlo to compute this
Change of variables
|q(z|v) dz| = |p(\varepsilon) d\varepsilon|
Reparametrisation trick
Base distribution p(\varepsilon) [normal or uniform] and a deterministic transformation z = t(\varepsilon, v) s.t. z~q(z|v). Then:
\nabla_v E_{q(z|v)}[f(z)] = E_{p(\varepsilon)}[\nabla_v f(t(\varepsilon, v))]
Note, we take the expectation w.r.t. base distribution now.
Reparametrisation ELBO gradient
\nabla_v F(v) = E_{p(\varepsilon)}[\nabla_v * ( log(p(x,t(\varepsilon, v)) - log(q(t(\varepsilon, v)|v) )]
\nabla_v F(v) = E_{p(\varepsilon)}[\nabla_z * ( log(p(x,z) - log(q(z|v) ) * \nabla_v t(\varepsilon, v)]
where z = t(\varepsilon, v)
What are the score function ELBO gradient properties
\+ Works for all models (continuous and discrete) \+ Works for a large class of variational approximations - Variance can be high, thus, slow convergence
What are the path wise gradient estimator ELBO gradient properties
- Requires differentiable models
- Requires the variational approximation to be expressed as a deterministic transformation z = t(\varepsilon, v)
+ Generally lower variance
Amortised variational inference in hierarchical Bayesian models
F(v) = E_q[log(p(x,\beta,z_{1:N}) - log(q(\beta, z_{1:N} | \lambda, \phi_{1:N})],
where v = {\lambda, \phi_{1:N}}
F(v) = E_q[log(p(x,\beta,z_{1:N})]
- E_q[log(q(\beta|\lambda) + \sum_n log(q(z_n| f(x_n, \theta) )],
where \phi_n = f(x_n, \theta), f is a deep neural network
Amortised SVI (Algorithm)
- Input: data x, model p(\beta, z, x)
- Initialise global variational parameters \lambda randomly
- Repeat:
3.1 Sample \beta ~ q(\beta | \labmda)
3.2 Sample data point x_n uniformly at random
3.3 Compute stochastic natural gradients
»> \tilde\nabla_\lambda ELBO
»> \tilde\nabla_\theta ELBO
3.4 Update global parameters
»> \lambda += \rho_t \tilde\nabla_\lambda ELBO [global variational parameters]
»> \theta += \rho_t \tilde\nabla_\theta ELBO [inference network parameters]
BBVI (Algorithm) [black box variational inference]
- Input: model p(x, z), variational approximation q(z|v)
- Repeat:
2.1 Draw S samples z^{(S)} ~q(z|v)
2.2 Update variational parameters [MC estimate of the score-function gradient of the ELBO]
»> v += \rho_t * 1/S \sum_s \nabla_v log(q(z^{(s)} | v) * (log(p(x, z^{(s)}) - log(q(z^{(s)} | v)) )
2.3 t += 1
SVI (Algorithm) [stochastic variational inference]
- Input: data x, model p(\beta, z, x)
- Initialise global variational parameters \lambda randomly
- Repeat:
3.1 Sample data point x_n uniformly at random
3.2 Update local parameter \phi_n
3.3 Compute intermediate global parameter \hat\lambda based on noisy natural gradient
3.4 Set global parameter
»> \lambda = (1-\rho_t) * \lambda + \rho_t * \hat\lambda
Mean-Field Approximation (Algorithm) [CW3]
- Input: data x, model p(\beta, z, x)
- Initialise global variational parameters \lambda randomly
- While ELBO has not converged, repeat:
- 1 For each data point x_n
- 1.1 Update local variational parameters \phi_n
- 2 Update global variational parameters \lambda