TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Generation	BAIR Robot Pushing	NUWA	FVD score	86.9	# 3
Video Generation	BAIR Robot Pushing	NUWA	Cond	1	# 1
Video Generation	BAIR Robot Pushing	NUWA	Pred	15	# 8
Video Generation	BAIR Robot Pushing	NUWA	Train	15	# 2
Text-to-Video Generation	Kinetics	NUWA (128×128)	Accuracy	77.9	# 1
Text-to-Image Generation	MS COCO	NÜWA (256 x 256)	FID	12.9	# 37
Text-to-Image Generation	MS COCO	NÜWA (256 x 256)	Inception score	27.2	# 13
Text-to-Image Generation	MS COCO	XMC-GAN (256 x 256)	FID	9.3	# 25
Text-to-Image Generation	MS COCO	XMC-GAN (256 x 256)	Inception score	30.5	# 9
Text-to-Image Generation	MS COCO	CogView (256 x 256)	FID	27.1	# 54
Text-to-Image Generation	MS COCO	CogView (256 x 256)	Inception score	18.2	# 19
Text-to-Image Generation	MS COCO	DALL-E (256 x 256)	FID	27.5	# 56
Text-to-Image Generation	MS COCO	DALL-E (256 x 256)	Inception score	17.9	# 21
Text-to-Image Generation	MS COCO	DF-GAN (256 x 256)	Inception score	18.7	# 18
Text-to-Image Generation	MS COCO	AttnGAN (256 x 256)	FID	35.2	# 63
Text-to-Image Generation	MS COCO	AttnGAN (256 x 256)	Inception score	23.3	# 17
Text-to-Image Generation	MS COCO	DM-GAN (256 x 256)	FID	26.0	# 52
Text-to-Image Generation	MS COCO	DM-GAN (256 x 256)	Inception score	32.2	# 7
Text-to-Video Generation	MSR-VTT	NUWA	FID	47.68	# 9
Text-to-Video Generation	MSR-VTT	NUWA	CLIPSIM	0.2439	# 15
Text-to-Video Generation	MSR-VTT	NUWA	CLIP-FID	47.68	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/nuwa-visual-synthesis-pre-training-for-neural/text-to-video-generation-on-kinetics)](https://paperswithcode.com/sota/text-to-video-generation-on-kinetics?p=nuwa-visual-synthesis-pre-training-for-neural)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/nuwa-visual-synthesis-pre-training-for-neural/video-generation-on-bair-robot-pushing)](https://paperswithcode.com/sota/video-generation-on-bair-robot-pushing?p=nuwa-visual-synthesis-pre-training-for-neural)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/nuwa-visual-synthesis-pre-training-for-neural/text-to-video-generation-on-msr-vtt)](https://paperswithcode.com/sota/text-to-video-generation-on-msr-vtt?p=nuwa-visual-synthesis-pre-training-for-neural)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/nuwa-visual-synthesis-pre-training-for-neural/text-to-image-generation-on-coco)](https://paperswithcode.com/sota/text-to-image-generation-on-coco?p=nuwa-visual-synthesis-pre-training-for-neural)`

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

24 Nov 2021 · Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, Nan Duan ·

This paper presents a unified multimodal pre-trained model called N\"UWA that can generate new or manipulate existing visual data (i.e., images and videos) for various visual synthesis tasks. To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and reduce the computational complexity. We evaluate N\"UWA on 8 downstream tasks. Compared to several strong baselines, N\"UWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good zero-shot capabilities on text-guided image and video manipulation tasks. Project repo is https://github.com/microsoft/NUWA.

PDF Abstract

Code

Add Remove Mark official

lucidrains/nuwa-pytorch

533

Tasks

Add Remove

Image Generation

Text-to-Image Generation

Text-to-Video Generation

Video Generation

Video Prediction

Datasets

MS COCO

Kinetics

MSR-VTT VSPW BAIR Robot Pushing

Results from the Paper

Edit

Ranked #1 on Text-to-Video Generation on Kinetics

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Generation	BAIR Robot Pushing	NUWA	FVD score	86.9	# 3	Compare
			Cond	1	# 1	Compare
			Pred	15	# 8	Compare
			Train	15	# 2	Compare
Text-to-Video Generation	Kinetics	NUWA (128×128)	Accuracy	77.9	# 1	Compare
Text-to-Image Generation	MS COCO	NÜWA (256 x 256)	FID	12.9	# 37	Compare
Text-to-Image Generation	MS COCO	NÜWA (256 x 256)	Inception score	27.2	# 13	Compare
Text-to-Image Generation	MS COCO	XMC-GAN (256 x 256)	FID	9.3	# 25	Compare
Text-to-Image Generation	MS COCO	XMC-GAN (256 x 256)	Inception score	30.5	# 9	Compare
Text-to-Image Generation	MS COCO	CogView (256 x 256)	FID	27.1	# 54	Compare
Text-to-Image Generation	MS COCO	CogView (256 x 256)	Inception score	18.2	# 19	Compare
Text-to-Image Generation	MS COCO	DALL-E (256 x 256)	FID	27.5	# 56	Compare
Text-to-Image Generation	MS COCO	DALL-E (256 x 256)	Inception score	17.9	# 21	Compare
Text-to-Image Generation	MS COCO	DF-GAN (256 x 256)	Inception score	18.7	# 18	Compare
Text-to-Image Generation	MS COCO	AttnGAN (256 x 256)	FID	35.2	# 63	Compare
Text-to-Image Generation	MS COCO	AttnGAN (256 x 256)	Inception score	23.3	# 17	Compare
Text-to-Image Generation	MS COCO	DM-GAN (256 x 256)	FID	26.0	# 52	Compare
Text-to-Image Generation	MS COCO	DM-GAN (256 x 256)	Inception score	32.2	# 7	Compare
Text-to-Video Generation	MSR-VTT	NUWA	FID	47.68	# 9	Compare
			CLIPSIM	0.2439	# 15	Compare
			CLIP-FID	47.68	# 6	Compare

Methods

Add Remove

1-bit Adam • 1-bit LAMB • 1cycle • 1x1 Convolution • Adam • LAMB

Edit Social Preview

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove