TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Text-to-Image Generation	CUB	Swinv2-Imagen	FID	9.78	# 2
Text-to-Image Generation	CUB	Swinv2-Imagen	Inception score	8.44	# 1
Text-to-Image Generation	MS COCO	Swinv2-Imagen	FID	7.21	# 15
Text-to-Image Generation	MS COCO	Swinv2-Imagen	Inception score	31.46	# 8
Text-to-Image Generation	Multi-Modal-CelebA-HQ	Swinv2-Imagen	FID	10.31	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swinv2-imagen-hierarchical-vision-transformer/text-to-image-generation-on-multi-modal)](https://paperswithcode.com/sota/text-to-image-generation-on-multi-modal?p=swinv2-imagen-hierarchical-vision-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swinv2-imagen-hierarchical-vision-transformer/text-to-image-generation-on-cub)](https://paperswithcode.com/sota/text-to-image-generation-on-cub?p=swinv2-imagen-hierarchical-vision-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swinv2-imagen-hierarchical-vision-transformer/text-to-image-generation-on-coco)](https://paperswithcode.com/sota/text-to-image-generation-on-coco?p=swinv2-imagen-hierarchical-vision-transformer)`

Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation

18 Oct 2022 · Ruijun Li, Weihua Li, Yi Yang, Hanyu Wei, Jianhua Jiang, Quan Bai ·

Recently, diffusion models have been proven to perform remarkably well in text-to-image synthesis tasks in a number of studies, immediately presenting new study opportunities for image generation. Google's Imagen follows this research trend and outperforms DALLE2 as the best model for text-to-image generation. However, Imagen merely uses a T5 language model for text processing, which cannot ensure learning the semantic information of the text. Furthermore, the Efficient UNet leveraged by Imagen is not the best choice in image processing. To address these issues, we propose the Swinv2-Imagen, a novel text-to-image diffusion model based on a Hierarchical Visual Transformer and a Scene Graph incorporating a semantic layout. In the proposed model, the feature vectors of entities and relationships are extracted and involved in the diffusion model, effectively improving the quality of generated images. On top of that, we also introduce a Swin-Transformer-based UNet architecture, called Swinv2-Unet, which can address the problems stemming from the CNN convolution operations. Extensive experiments are conducted to evaluate the performance of the proposed model by using three real-world datasets, i.e., MSCOCO, CUB and MM-CelebA-HQ. The experimental results show that the proposed Swinv2-Imagen model outperforms several popular state-of-the-art methods.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Image Generation

Language Modelling

Text-to-Image Generation

Datasets

MS COCO

CUB-200-2011

Multi-Modal CelebA-HQ

Results from the Paper

Edit

Ranked #1 on Text-to-Image Generation on Multi-Modal-CelebA-HQ

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Text-to-Image Generation	CUB	Swinv2-Imagen	FID	9.78	# 2	Compare
Text-to-Image Generation	CUB	Swinv2-Imagen	Inception score	8.44	# 1	Compare
Text-to-Image Generation	MS COCO	Swinv2-Imagen	FID	7.21	# 15	Compare
Text-to-Image Generation	MS COCO	Swinv2-Imagen	Inception score	31.46	# 8	Compare
Text-to-Image Generation	Multi-Modal-CelebA-HQ	Swinv2-Imagen	FID	10.31	# 1	Compare

Methods

Add Remove

Absolute Position Encodings • Adafactor • Adam • Attention Dropout • BPE • Convolution • Dense Connections • Diffusion • Dropout • GELU • GLU • Inverse Square Root Schedule • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • SentencePiece • Softmax • T5 • Transformer

Edit Social Preview

Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove