TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Captioning	COCO Captions	L-Verse	BLEU-4	39.9	# 20
Image Captioning	COCO Captions	L-Verse	METEOR	31.4	# 7
Image Captioning	COCO Captions	L-Verse	ROUGE-L	60.4	# 3
Image Captioning	COCO Captions	L-Verse	SPICE	23.3	# 20
Image Reconstruction	ImageNet 256x256	AugVAE-ML	FID	1.04	# 1
Image Reconstruction	ImageNet 256x256	AugVAE-SL	FID	3.28	# 2
Text-to-Image Generation	MS COCO	L-Verse-CC	FID	37.2	# 64
Text-to-Image Generation	MS COCO	L-Verse-CC	FID-1	31.6	# 3
Text-to-Image Generation	MS COCO	L-Verse-CC	FID-8	21.1	# 2
Text-to-Image Generation	MS COCO	L-Verse-CC	FID-2	25.7	# 3
Text-to-Image Generation	MS COCO	L-Verse-CC	FID-4	21.4	# 3
Text-to-Image Generation	MS COCO	L-Verse	FID	45.8	# 65
Text-to-Image Generation	MS COCO	L-Verse	FID-1	41.9	# 4
Text-to-Image Generation	MS COCO	L-Verse	FID-8	29.83	# 4
Text-to-Image Generation	MS COCO	L-Verse	FID-2	35.5	# 4
Text-to-Image Generation	MS COCO	L-Verse	FID-4	30.2	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/l-verse-bidirectional-generation-between/image-reconstruction-on-imagenet-256x256)](https://paperswithcode.com/sota/image-reconstruction-on-imagenet-256x256?p=l-verse-bidirectional-generation-between)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/l-verse-bidirectional-generation-between/image-captioning-on-coco-captions)](https://paperswithcode.com/sota/image-captioning-on-coco-captions?p=l-verse-bidirectional-generation-between)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/l-verse-bidirectional-generation-between/text-to-image-generation-on-coco)](https://paperswithcode.com/sota/text-to-image-generation-on-coco?p=l-verse-bidirectional-generation-between)`

L-Verse: Bidirectional Generation Between Image and Text

CVPR 2022 · TaeHoon Kim, Gwangmo Song, Sihaeng Lee, Sangyun Kim, Yewon Seo, Soonyoung Lee, Seung Hwan Kim, Honglak Lee, Kyunghoon Bae ·

Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalability. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between image and text, we propose L-Verse, a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART) for image-to-text and text-to-image generation. Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild. Unlike other models, BiART can distinguish between image (or text) as a conditional reference and a generation target. L-Verse can be directly used for image-to-text or text-to-image generation without any finetuning or extra object detection framework. In quantitative and qualitative experiments, L-Verse shows impressive results against previous methods in both image-to-text and text-to-image generation on MS-COCO Captions. We furthermore assess the scalability of L-Verse architecture on Conceptual Captions and present the initial result of bidirectional vision-language representation learning on general domain.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

tgisaturday/L-Verse official

108

Tasks

Add Remove

Image Captioning

Image Generation

Image Reconstruction

object-detection

Object Detection

Representation Learning

Text Generation

Text-to-Image Generation

Zero-Shot Text-to-Image Generation

Datasets

ImageNet

MS COCO

Conceptual Captions

COCO Captions

Results from the Paper

Edit

Ranked #1 on Image Reconstruction on ImageNet 256x256

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Captioning	COCO Captions	L-Verse	BLEU-4	39.9	# 20	Compare
			METEOR	31.4	# 7	Compare
			ROUGE-L	60.4	# 3	Compare
			SPICE	23.3	# 20	Compare
Image Reconstruction	ImageNet 256x256	AugVAE-ML	FID	1.04	# 1	Compare
Image Reconstruction	ImageNet 256x256	AugVAE-SL	FID	3.28	# 2	Compare
Text-to-Image Generation	MS COCO	L-Verse-CC	FID	37.2	# 64	Compare
			FID-1	31.6	# 3	Compare
			FID-8	21.1	# 2	Compare
			FID-2	25.7	# 3	Compare
			FID-4	21.4	# 3	Compare
Text-to-Image Generation	MS COCO	L-Verse	FID	45.8	# 65	Compare
			FID-1	41.9	# 4	Compare
			FID-8	29.83	# 4	Compare
			FID-2	35.5	# 4	Compare
			FID-4	30.2	# 4	Compare

Methods

Add Remove

BPE • Linear Layer • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Transformer • VQ-VAE

Edit Social Preview

L-Verse: Bidirectional Generation Between Image and Text

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove