Generative Image Dynamics - 2024 Flashcards
What is the goal of the generative images dynamics 2024 paper?
Given a single picture I0, our goal is to generate a video { ˆI1, ˆI2., …, ˆIT } featuring oscillatory motions such as those of trees, flowers, or candle flames swaying in the breeze.
What are the 2 systems in the generative images dynamics 2024 paper?
1 A motion prediction module
2 An image-based rendering module
Gen image - What are the pipeline steps for motion prediction?
1 - a latent diffusion model (LDM) to predict a spectral volume S = Sf0 , Sf1 , …, SfK−1 for the input I0
2 - Transformation to a motion texture F = (F1, F2, …, FT ) through an inverse discrete Fourier transform.
Gen image - How do they formulate the motion prediction problem?
from an input image to an output motion spectral volume
Gen image - How are the motion trajectory of a pixel at future time steps and its representation as a spectral volume related?
S(p) = FFT(F(p))
Gen image - What are the most representative frequencies for real-time animation?
K=16 first Fourier coefficients are sufficient to realistically reproduce the original natural motion.
Gen image - What are the 2 main parts of the LDM?
1 VAE
2 U-net based diffusion
Gen image - what is the problem with vanilla normalisation of the spectrum coefficients?
nearly all the coefficients at higher frequencies will end up close to zero
Models trained on such data can produce inaccurate
motions, since during inference, even small prediction errors
can cause large relative errors after denormalization
Gen image - how do they normalise the coefficients?
First, we inde-
pendently normalize Fourier coefficients at each frequency
based on statistics computed from the training set.
We then apply a power transformation to each scaled Fourier
coefficient to pull it away from extreme values.
Gen image - What is the problem with generating the image at time t from the input image when one is doing with an NN straight forward?
The size of the motion texture would need to scale with the length of the video. Predicting more and more frames for a longer video time.
Gen image - how do the authors overcome the problem of predicting long videos?
They predict harmonic spectral coefficients which can describe compressively the motion of oscillatory videos.
Gen image - describe the motion prediction module training phase - LDM.
A VAE is trained to have latent vectors for the spectral volume and to decode them to spectral volumes.
The U-Net is trained to denoise noisy latent that were diffused with a pre-defined schedule. The U-Net is also given a condition which are the embeddings of the input image.
Gen image - What is the training loss for the LDM training?
LLDM=En∼U[1,N],ϵn∼N(0,1)[∥ϵn−ϵθ(zn;n,c)∥2]
Gen image - What is their training input and how do they produce it?
Their training input is an image and its spectral volume describing its motion.
Starting from a video they take the first frame as the image.
They produce the spectral volume using FFT.
Gen image - What is the architecture for their image-based rendering module?
1 Feature extractor from the input image
2 Motion at time t
3 Per-pixel motion weights
4 Softmax splatting is working on each scale separately for 1+2+3
5 Synthesis network takes each scale and produces the image at time t
Gen image - how do they get the motion at time t?
Inverse FFT of the spectral volumes. Then take the t index.
Gen image - how do they create the per-pixel motion weights?
W (p) = 1/T Σt ||Ft (p)||2
Gen image - what is the feature pyramid softmax splatting
strategy?
The splatting is performed on each scale.
Applies softmax to smooth the result.
Reconstruct the motion from each scale in a synthesis fashion.