Generative Image Dynamics - 2024 Flashcards

1
Q

What is the goal of the generative images dynamics 2024 paper?

A

Given a single picture I0, our goal is to generate a video { ˆI1, ˆI2., …, ˆIT } featuring oscillatory motions such as those of trees, flowers, or candle flames swaying in the breeze.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 2 systems in the generative images dynamics 2024 paper?

A

1 A motion prediction module
2 An image-based rendering module

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Gen image - What are the pipeline steps for motion prediction?

A

1 - a latent diffusion model (LDM) to predict a spectral volume S = Sf0 , Sf1 , …, SfK−1 for the input I0
2 - Transformation to a motion texture F = (F1, F2, …, FT ) through an inverse discrete Fourier transform.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Gen image - How do they formulate the motion prediction problem?

A

from an input image to an output motion spectral volume

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Gen image - How are the motion trajectory of a pixel at future time steps and its representation as a spectral volume related?

A

S(p) = FFT(F(p))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Gen image - What are the most representative frequencies for real-time animation?

A

K=16 first Fourier coefficients are sufficient to realistically reproduce the original natural motion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Gen image - What are the 2 main parts of the LDM?

A

1 VAE
2 U-net based diffusion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Gen image - what is the problem with vanilla normalisation of the spectrum coefficients?

A

nearly all the coefficients at higher frequencies will end up close to zero
Models trained on such data can produce inaccurate
motions, since during inference, even small prediction errors
can cause large relative errors after denormalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Gen image - how do they normalise the coefficients?

A

First, we inde-
pendently normalize Fourier coefficients at each frequency
based on statistics computed from the training set.
We then apply a power transformation to each scaled Fourier
coefficient to pull it away from extreme values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Gen image - What is the problem with generating the image at time t from the input image when one is doing with an NN straight forward?

A

The size of the motion texture would need to scale with the length of the video. Predicting more and more frames for a longer video time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Gen image - how do the authors overcome the problem of predicting long videos?

A

They predict harmonic spectral coefficients which can describe compressively the motion of oscillatory videos.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Gen image - describe the motion prediction module training phase - LDM.

A

A VAE is trained to have latent vectors for the spectral volume and to decode them to spectral volumes.
The U-Net is trained to denoise noisy latent that were diffused with a pre-defined schedule. The U-Net is also given a condition which are the embeddings of the input image.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Gen image - What is the training loss for the LDM training?

A

LLDM​=En∼U[1,N],ϵn​∼N(0,1)​[∥ϵn​−ϵθ​(zn​;n,c)∥2]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Gen image - What is their training input and how do they produce it?

A

Their training input is an image and its spectral volume describing its motion.
Starting from a video they take the first frame as the image.
They produce the spectral volume using FFT.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Gen image - What is the architecture for their image-based rendering module?

A

1 Feature extractor from the input image
2 Motion at time t
3 Per-pixel motion weights
4 Softmax splatting is working on each scale separately for 1+2+3
5 Synthesis network takes each scale and produces the image at time t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Gen image - how do they get the motion at time t?

A

Inverse FFT of the spectral volumes. Then take the t index.

17
Q

Gen image - how do they create the per-pixel motion weights?

A

W (p) = 1/T Σt ||Ft (p)||2

18
Q

Gen image - what is the feature pyramid softmax splatting
strategy?

A

The splatting is performed on each scale.
Applies softmax to smooth the result.
Reconstruct the motion from each scale in a synthesis fashion.