VAE Flashcards
How should one optimise the ELBO and what would happen to p(x) and D_KL?
To optimise the ELBO is to maximise it. It then most likely would maximise the log likelihood because it’s a lower bound of it.
It’s also likely that KL would decrease as optimising theta and phi would bring q close to p which would reduce the KL divergence.
What is the loss of p(x/z)?
The loss is the expectation over Etta of log(p(x/z)). The expectation is estimated as the average.
What happens to the loss of p(x/z) when we sample from a minibatch?
When sampling from a minibatch, we can estimate the average over the same x as just 1 sample, reducing computations.
How would one optimise the ELBO and what problem arises?
To optimise we’d derive and try and get to a minimum/maximum.
With respect to theta (decoder params) there’s no problem. With respect to phi (encoder params) we get an integral over z which is hard to estimate or to compute.
What is the reparametrisation trick
Converting z into a function of phi, x and a Gaussian(0,1)-Etta therefore making z deterministic. In practice, phi params determine μ and σ. After materialising Etta we can get z = etta*sigma + mu.
What are the 4 main principles that VAE is based on
1 AE framework
2 A Gaussian sampling procedure
3 The latent vector prior is Gaussian (0,1)
4 a reconstruction loss and a KL divergence loss
Why by design VAE either overfit or smooth the input?
Because the loss components of D KL and reconstruction loss have a tradeoff between them to either overfit (reconstruction) or to put latent vectors in the same space which makes a linear combination of the inputs.
AE - Why AE is not designed to generate new samples from the latent space?
Because there is no regularisation on the latent space which means that different points in latent space would produce unpredictable results.
What change over AE makes VAE generate meaningful new samples from the latent space?
VAE assumes that x was generated from a random process z. It uses VI framework to express log(p(x)) in terms of the ELBO and DKL. Then optimising the ELBO and DKL ensures meaningful interpretation in x from the random process z.
What is the assumption over the data that is able to be presented by latent vectors?
That the underlying representation of the data can be captured with less information than the representation that is being shown.
what is VAE designed to do concerning data generation?
It was designed to generate new data points that look like our data.
What is the definition of ‘Variational Inference’?
Approximate a distribution with another distribution.
What is ‘Amortized VI’?
The process of sharing the parameters of the approximation distribution with all the relevant data points rather than optimise the parameters for each point in the data.
Why was KL divergence chosen in the paper?
Because after the reparametrisation trick the ELBO can be expressed by the D_KL and E[log(p(x/z))]. They chose to express it like this because it has an analytic solution which can be easily written down with μ and σ.
If the loss was weighted with a high emphasis on reconstruction, what would happen?
We’d revert back to a sort of AE that reconstructs the examples very well but any sampling in the latent space in between groups would make no sense.
If the loss was weighted with a high emphasis on D KL, what would happen?
The posterior would be very close to N(0,1) and we’d get a blurred reconstruction of the examples whichever sample we choose.
What is the purpose of the reparameterization trick in VAEs?
To get the loss (that connects to log(p(x))) to a computed state which we can optimise.
What assumption does VAE make about data generation?
We assume that the data are generated by some random process, involving an unobserved continuous random variable z.
Give the initial 2 abstract steps of the data generation process
(1) a value z(i) is generated from some prior distribution pθ∗ (z); (2) a value x(i) is generated from some conditional distribution pθ∗ (x|z).