TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Text-to-Video Generation	MSR-VTT	Make-A-Video	FID	13.17	# 5
Text-to-Video Generation	MSR-VTT	Make-A-Video	CLIPSIM	0.3049	# 4
Text-to-Video Generation	MSR-VTT	Make-A-Video	CLIP-FID	13.17	# 3
Text-to-Video Generation	MSR-VTT	CogVideo (English)	FID	23.59	# 7
Text-to-Video Generation	MSR-VTT	CogVideo (English)	CLIPSIM	0.2631	# 13
Text-to-Video Generation	MSR-VTT	CogVideo (English)	CLIP-FID	23.59	# 4
Video Generation	UCF-101	Make-A-Video (Zero-shot, 256x256, class-conditional)	Inception Score	33	# 22
Video Generation	UCF-101	Make-A-Video (Zero-shot, 256x256, class-conditional)	FVD16	367.23	# 20
Video Generation	UCF-101	Make-A-Video (Finetuning, 256x256, class-conditional)	Inception Score	82.55	# 3
Video Generation	UCF-101	Make-A-Video (Finetuning, 256x256, class-conditional)	FVD16	81.25	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/make-a-video-text-to-video-generation-without/text-to-video-generation-on-msr-vtt)](https://paperswithcode.com/sota/text-to-video-generation-on-msr-vtt?p=make-a-video-text-to-video-generation-without)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/make-a-video-text-to-video-generation-without/video-generation-on-ucf-101)](https://paperswithcode.com/sota/video-generation-on-ucf-101?p=make-a-video-text-to-video-generation-without)`

Make-A-Video: Text-to-Video Generation without Text-Video Data

29 Sep 2022 · Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, Yaniv Taigman ·

We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today's image generation models. We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. First, we decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, we design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.

PDF Abstract

Code

Add Remove Mark official

lucidrains/make-a-video-pytorch

1,838

xuduo35/MakeLongVideo

Tasks

Add Remove

Image Generation

Super-Resolution

Text-to-Video Generation

Video Generation

Datasets

UCF101

MSR-VTT

WebVid DrawBench

Results from the Paper

Edit

Ranked #3 on Text-to-Video Generation on MSR-VTT (CLIP-FID metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Text-to-Video Generation	MSR-VTT	Make-A-Video	FID	13.17	# 5	Compare
			CLIPSIM	0.3049	# 4	Compare
			CLIP-FID	13.17	# 3	Compare
Text-to-Video Generation	MSR-VTT	CogVideo (English)	FID	23.59	# 7	Compare
			CLIPSIM	0.2631	# 13	Compare
			CLIP-FID	23.59	# 4	Compare
Video Generation	UCF-101	Make-A-Video (Zero-shot, 256x256, class-conditional)	Inception Score	33	# 22	Compare
Video Generation	UCF-101	Make-A-Video (Zero-shot, 256x256, class-conditional)	FVD16	367.23	# 20	Compare
Video Generation	UCF-101	Make-A-Video (Finetuning, 256x256, class-conditional)	Inception Score	82.55	# 3	Compare
Video Generation	UCF-101	Make-A-Video (Finetuning, 256x256, class-conditional)	FVD16	81.25	# 4	Compare

Methods

Add Remove

Concatenated Skip Connection • Convolution • Max Pooling • ReLU • U-Net

Edit Social Preview

Make-A-Video: Text-to-Video Generation without Text-Video Data

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove