VQ-VAE Flashcards by Alon Samuel

VQ-VAE - What are the 3 main contributions of VQ-VAE in comparison to VAE

1 Quantisation of the latent space
2 Restrict the latent space to a set of vectors (codebook).
3 The prior of the codebook is learned rather than static

How well did you know this?

Not at all

Perfectly

VQ-VAE - How does it changes the AE framework

Quantisation and restriction of the latent space to predefined codewords.

How well did you know this?

Not at all

Perfectly

VQ-VAE - How does it changes the sampling procedure of VAE?

It assumes that the prior is a uniform distribution over the codebook and the posterior of the decoder input is a delta function that gives back the nearest codeword

How well did you know this?

Not at all

Perfectly

VQ-VAE - What happens to the KL divergence

Becomes constant and is removed from the learning

How well did you know this?

Not at all

Perfectly

VQ-VAE - What happens to the prior

It is learned in the training process

How well did you know this?

Not at all

Perfectly

VQ-VAE - What is the loss?

The reconstruction loss, the codebook loss and the commitment loss

How well did you know this?

Not at all

Perfectly

VQ-VAE - What is the codebook loss?

∣∣sg[Ze(x)]−ek∣∣2

How well did you know this?

Not at all

Perfectly

VQ-VAE - What is the commitment loss?

β∣∣Ze(x)−sg[ek]∣∣2

How well did you know this?

Not at all

Perfectly

VQ-VAE - Why is VQ-VAE more efficient than pixel-space autoregressive models when generating images?

Because it samples an autoregressive model only in the compressed latent space.

How well did you know this?

Not at all

Perfectly

VQ-VAE - What is an autoregressive model?

An autoregressive model is a type of model where the output at each step depends on the input and previously generated outputs.

How well did you know this?

Not at all

Perfectly

VQ-VAE-2 - How do they contribute in comparison to VQ-VAE?

They demonstrate that a multi-scale hierarchical organization of VQ-VAE, augmented with powerful priors over the latent codes, can generate samples with quality that rivals GAN.

How well did you know this?

Not at all

Perfectly

VQ-VAE - What is the method that the VQ-VAE paper uses to learn how to generate new samples?

After training the VQ-VAE on the dataset and stabilised a codebook and the encoder and decoder, they trained an autoregressive model ( PixelCNN - like a transformer) that would learn the distribution of latent vectors.

How well did you know this?

Not at all

Perfectly

VQ-VAE - how would the autoregressive model train on the latent codes?

The system would randomly choose a data point and use the encoder to generate the latent codes for it (or just choose latent codes for it). It would randomise which codes can be seen from the start and the GT would be the next correct codeword.

How well did you know this?

Not at all

Perfectly

VQ-VAE - What are the steps to generate a new unseen sample (including training)?

1 - Finish training the VQ-VAE on the dataset
2 - Train an autoregressive model on the latent codes
3 - Randomise the first latent code
4 - Use the autoregressive model to predict the next codes in order
5 - Use the decoder to generate an unseen sample

How well did you know this?

Not at all

Perfectly

VQ-VAE 2 - what was the motivation to separate the latent space into top and bottom levels?

Texture is encoded in a bottom level and shape and geometry is at a top level.

How well did you know this?

Not at all

Perfectly

How does the encoder in VQ-VAE map observations onto a discrete latent space?

The encoder is a non-linear mapping from the input space, x, to a vector E(x). This vector is then quantized based on its distance to the prototype vectors in the codebook ek, k ∈ 1 . . . K such that each vector E(x) is replaced by the index of the nearest prototype vector in the codebook.

What is the purpose of the codebook loss in VQ-VAE?

The codebook loss, which only applies to the codebook variables, brings the selected codebook e close to the output of the encoder, E(x).

What role does the commitment loss play in the training of a VQ-VAE?

The commitment loss, which only applies to the encoder weights, encourages the output of the encoder to stay close to the chosen codebook vector to prevent it from fluctuating too frequently from one code vector to another.

How does the hierarchical structure of VQ-VAE 2 improve latent representation?

The main motivation behind this is to model local information, such as texture, separately from global information such as shape and geometry of objects. The prior model over each level can thus be tailored to capture the specific correlations that exist in that level.

Why does the bottom latent code in VQ-VAE 2 depend on the top latent code?

We note if we did not condition the bottom latent on the top latent, then the top latent would need to encode every detail from the pixels.

How does VQ-VAE 2’s decoder reconstruct images from latent variables?

The decoder is similarly a feed-forward network that takes as input all levels of the quantized latent hierarchy. It consists of a few residual blocks followed by a number of strided transposed convolutions to upsample the representations back to the original image size.

What is the primary role of the top-level prior model in VQ-VAE 2? How does it affect the NN architecture chosen for prior learning?

The prior over the top latent map is responsible for structural global information. Thus, we equip it with multi-headed self-attention layers so it can benefit from a larger receptive field to capture correlations in spatial locations that are far apart in the image.

VQ-VAE 2 - Why does the bottom-level prior model not use attention layers?

For this prior over local information, we thus find that using large conditioning stacks (coming from the top prior) yields good performance. Using self-attention layers as in the top-level prior would not be practical due to memory constraints.

Why is fitting prior distributions using neural networks from training data beneficial for latent variable models?

Fitting prior distributions using neural networks from
training data has become common practice, as it can significantly improve the performance of latent
variable models [5]. This procedure also reduces the gap between the marginal posterior and the
prior. Thus, latent variables sampled from the learned prior at test time are close to what the decoder
network has observed during training which results in more coherent outputs.

VQ-VAE 2 - How does the proximity of generated samples to the true data manifold affect classification by a pre-trained classifier?

The closer our samples are to the true data manifold, the more likely they are classified to the correct class labels by a pre-trained classifier.