Latent Video Diffusion Models for High-Fidelity Long Video Generation

23 Nov 2022  ·  Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, Qifeng Chen ·

AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Generation Sky Time-lapse LVDM (256x256) KVD16 3.9 # 4
FVD 16 95.2 # 2
Video Generation Sky Time-lapse Long-video GAN (128x128) FVD 16 107.5 # 3
Video Generation Sky Time-lapse DIGAN (128x128) KVD16 6.8 # 2
FVD 16 114.6 # 4
Video Generation Sky Time-lapse Long-video GAN (256x256) FVD 16 116.5 # 5
Video Generation Sky Time-lapse TATS (128x128) KVD16 5.7 # 3
FVD 16 132.6 # 6
Video Generation Sky Time-lapse MoCoGAN-HD (128x128) KVD16 13.9 # 1
FVD 16 183.6 # 7
Video Generation Taichi DIGAN (256x256) FVD16 156.7 # 6
Video Generation Taichi LVDM (256x256) FVD16 99 # 3
KVD16 15.3 # 3
Video Generation Taichi TATS (128x128) FVD16 94.6 # 2
KVD16 9.8 # 4
Video Generation Taichi DIGAN (128x128) FVD16 128.1 # 4
KVD16 20.6 # 2
Video Generation Taichi MoCoGAN-HD (128x128) FVD16 144.7 # 5
KVD16 25.4 # 1
Video Generation UCF-101 MCVD FVD16 2460 # 36
KVD16 148 # 6
Video Generation UCF-101 VDM FVD16 1396 # 35
KVD16 116 # 5
Video Generation UCF-101 TGAN-v2 (128x128) FVD16 1209 # 34
Video Generation UCF-101 LVDM (256x256, unconditional) FVD16 552 # 28
KVD16 42 # 3
FVD16 372 # 21
KVD16 27 # 1

Methods