TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image-to-Image Translation	ADE20K Labels-to-Photos	VQGAN+Transformer	FID	35.5	# 11
Image Generation	CelebA 256x256	VQGAN	FID	10.2	# 4
Image Generation	CelebA-HQ 256x256	VQGAN+Transformer	FID	10.2	# 11
Image-to-Image Translation	COCO-Stuff Labels-to-Photos	VQGAN+Transformer	FID	22.4	# 8
Text-to-Image Generation	Conceptual Captions	VQ-GAN	FID	28.86	# 5
DeepFake Detection	FakeAVCeleb	VQGAN	ROC AUC	51.8	# 9
DeepFake Detection	FakeAVCeleb	VQGAN	AP	55.0	# 9
Image Generation	FFHQ 256 x 256	VQGAN+Transformer	FID	9.6	# 25
Image Generation	ImageNet 256x256	VQGAN+Transformer (k=600, p=1.0, a=0.05)	FID	5.2	# 36
Image Generation	ImageNet 256x256	VQGAN+Transformer (k=mixed, p=1.0, a=0.005)	FID	6.59	# 38
Image Outpainting	LHQC	Taming	Block-FID (Right Extend)	22.53	# 4
Image Outpainting	LHQC	Taming	Block-FID (Left Extend)	-	# 4
Image Outpainting	LHQC	Taming	Block-FID (Down Extend)	26.38	# 4
Image Outpainting	LHQC	Taming	Block-FID (Up Extend)	-	# 4
Text-to-Image Generation	LHQC	Taming	Block-FID	38.89	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taming-transformers-for-high-resolution-image/text-to-image-generation-on-lhqc)](https://paperswithcode.com/sota/text-to-image-generation-on-lhqc?p=taming-transformers-for-high-resolution-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taming-transformers-for-high-resolution-image/image-generation-on-celeba-256x256)](https://paperswithcode.com/sota/image-generation-on-celeba-256x256?p=taming-transformers-for-high-resolution-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taming-transformers-for-high-resolution-image/image-outpainting-on-lhqc)](https://paperswithcode.com/sota/image-outpainting-on-lhqc?p=taming-transformers-for-high-resolution-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taming-transformers-for-high-resolution-image/text-to-image-generation-on-conceptual)](https://paperswithcode.com/sota/text-to-image-generation-on-conceptual?p=taming-transformers-for-high-resolution-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taming-transformers-for-high-resolution-image/image-to-image-translation-on-coco-stuff)](https://paperswithcode.com/sota/image-to-image-translation-on-coco-stuff?p=taming-transformers-for-high-resolution-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taming-transformers-for-high-resolution-image/deepfake-detection-on-fakeavceleb-1)](https://paperswithcode.com/sota/deepfake-detection-on-fakeavceleb-1?p=taming-transformers-for-high-resolution-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taming-transformers-for-high-resolution-image/image-to-image-translation-on-ade20k-labels)](https://paperswithcode.com/sota/image-to-image-translation-on-ade20k-labels?p=taming-transformers-for-high-resolution-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taming-transformers-for-high-resolution-image/image-generation-on-celeba-hq-256x256)](https://paperswithcode.com/sota/image-generation-on-celeba-hq-256x256?p=taming-transformers-for-high-resolution-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taming-transformers-for-high-resolution-image/image-generation-on-ffhq-256-x-256)](https://paperswithcode.com/sota/image-generation-on-ffhq-256-x-256?p=taming-transformers-for-high-resolution-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taming-transformers-for-high-resolution-image/image-generation-on-imagenet-256x256)](https://paperswithcode.com/sota/image-generation-on-imagenet-256x256?p=taming-transformers-for-high-resolution-image)`

Taming Transformers for High-Resolution Image Synthesis

CVPR 2021 · Patrick Esser, Robin Rombach, Björn Ommer ·

Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers and obtain the state of the art among autoregressive models on class-conditional ImageNet. Code and pretrained models can be found at https://github.com/CompVis/taming-transformers .

PDF Abstract CVPR 2021 PDF CVPR 2021 Abstract

Code

Add Remove Mark official

CompVis/taming-transformers official

↳ Quickstart in

Colab

5,353

alibaba/EasyNLP

1,941

dome272/vqgan-pytorch

357

dome272/VQGAN

357

v-iashin/SpecVQGAN

↳ Quickstart in

Colab

Spaces

316

See all 12 implementations

Tasks

Add Remove

DeepFake Detection

Image Generation

Image Outpainting

Image-to-Image Translation

Inductive Bias

Text-to-Image Generation

Vocal Bursts Intensity Prediction

Datasets

ImageNet

CelebA

FFHQ

ADE20K

CelebA-HQ

LSUN

DeepFashion

Conceptual Captions

COCO-Stuff

FakeAVCeleb

LHQ

Results from the Paper

Edit

Ranked #3 on Text-to-Image Generation on LHQC

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image-to-Image Translation	ADE20K Labels-to-Photos	VQGAN+Transformer	FID	35.5	# 11	Compare
Image Generation	CelebA 256x256	VQGAN	FID	10.2	# 4	Compare
Image Generation	CelebA-HQ 256x256	VQGAN+Transformer	FID	10.2	# 11	Compare
Image-to-Image Translation	COCO-Stuff Labels-to-Photos	VQGAN+Transformer	FID	22.4	# 8	Compare
Text-to-Image Generation	Conceptual Captions	VQ-GAN	FID	28.86	# 5	Compare
DeepFake Detection	FakeAVCeleb	VQGAN	ROC AUC	51.8	# 9	Compare
DeepFake Detection	FakeAVCeleb	VQGAN	AP	55.0	# 9	Compare
Image Generation	FFHQ 256 x 256	VQGAN+Transformer	FID	9.6	# 25	Compare
Image Generation	ImageNet 256x256	VQGAN+Transformer (k=600, p=1.0, a=0.05)	FID	5.2	# 36	Compare
Image Generation	ImageNet 256x256	VQGAN+Transformer (k=mixed, p=1.0, a=0.005)	FID	6.59	# 38	Compare
Image Outpainting	LHQC	Taming	Block-FID (Right Extend)	22.53	# 4	Compare
			Block-FID (Left Extend)	-	# 4	Compare
			Block-FID (Down Extend)	26.38	# 4	Compare
			Block-FID (Up Extend)	-	# 4	Compare
Text-to-Image Generation	LHQC	Taming	Block-FID	38.89	# 3	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Taming Transformers for High-Resolution Image Synthesis

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove