TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Text-to-Video Generation	MSR-VTT	MagicVideo	FID	36.5	# 8
Text-to-Video Generation	MSR-VTT	MagicVideo	FVD	998	# 10
Text-to-Video Generation	UCF-101	MagicVideo (Zero-shot, 256x256)	FVD16	699	# 11
Video Generation	UCF-101	MagicVideo (256x256, text-conditional)	FVD16	699	# 31

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/magicvideo-efficient-video-generation-with/text-to-video-generation-on-msr-vtt)](https://paperswithcode.com/sota/text-to-video-generation-on-msr-vtt?p=magicvideo-efficient-video-generation-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/magicvideo-efficient-video-generation-with/text-to-video-generation-on-ucf-101)](https://paperswithcode.com/sota/text-to-video-generation-on-ucf-101?p=magicvideo-efficient-video-generation-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/magicvideo-efficient-video-generation-with/video-generation-on-ucf-101)](https://paperswithcode.com/sota/video-generation-on-ucf-101?p=magicvideo-efficient-video-generation-with)`

MagicVideo: Efficient Video Generation With Latent Diffusion Models

20 Nov 2022 · Daquan Zhou, Weimin WANG, Hanshu Yan, Weiwei Lv, Yizhe Zhu, Jiashi Feng ·

We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo. MagicVideo can generate smooth video clips that are concordant with the given text descriptions. Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card, which takes around 64x fewer computations than the Video Diffusion Models (VDM) in terms of FLOPs. In specific, unlike existing works that directly train video models in the RGB space, we use a pre-trained VAE to map video clips into a low-dimensional latent space and learn the distribution of videos' latent codes via a diffusion model. Besides, we introduce two new designs to adapt the U-Net denoiser trained on image tasks to video data: a frame-wise lightweight adaptor for the image-to-video distribution adjustment and a directed temporal attention module to capture temporal dependencies across frames. Thus, we can exploit the informative weights of convolution operators from a text-to-image model for accelerating video training. To ameliorate the pixel dithering in the generated videos, we also propose a novel VideoVAE auto-encoder for better RGB reconstruction. We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content. Refer to \url{https://magicvideo.github.io/#} for more examples.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Text-to-Video Generation

Video Generation

Datasets

UCF101

MSR-VTT

WebVid DrawBench

Results from the Paper

Edit

Ranked #10 on Text-to-Video Generation on MSR-VTT

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Text-to-Video Generation	MSR-VTT	MagicVideo	FID	36.5	# 8	Compare
Text-to-Video Generation	MSR-VTT	MagicVideo	FVD	998	# 10	Compare
Text-to-Video Generation	UCF-101	MagicVideo (Zero-shot, 256x256)	FVD16	699	# 11	Compare
Video Generation	UCF-101	MagicVideo (256x256, text-conditional)	FVD16	699	# 31	Compare

Methods

Add Remove

Concatenated Skip Connection • Convolution • Diffusion • Max Pooling • ReLU • Temporal attention • U-Net • VAE

Edit Social Preview

MagicVideo: Efficient Video Generation With Latent Diffusion Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove