L-Verse: Bidirectional Generation Between Image and Text

Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalability. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between image and text, we propose L-Verse, a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART) for image-to-text and text-to-image generation. Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild. Unlike other models, BiART can distinguish between image (or text) as a conditional reference and a generation target. L-Verse can be directly used for image-to-text or text-to-image generation without any finetuning or extra object detection framework. In quantitative and qualitative experiments, L-Verse shows impressive results against previous methods in both image-to-text and text-to-image generation on MS-COCO Captions. We furthermore assess the scalability of L-Verse architecture on Conceptual Captions and present the initial result of bidirectional vision-language representation learning on general domain.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Captioning COCO Captions L-Verse BLEU-4 39.9 # 20
METEOR 31.4 # 7
ROUGE-L 60.4 # 3
SPICE 23.3 # 20
Image Reconstruction ImageNet 256x256 AugVAE-ML FID 1.04 # 1
Image Reconstruction ImageNet 256x256 AugVAE-SL FID 3.28 # 2
Text-to-Image Generation MS COCO L-Verse-CC FID 37.2 # 64
FID-1 31.6 # 3
FID-8 21.1 # 2
FID-2 25.7 # 3
FID-4 21.4 # 3
Text-to-Image Generation MS COCO L-Verse FID 45.8 # 65
FID-1 41.9 # 4
FID-8 29.83 # 4
FID-2 35.5 # 4
FID-4 30.2 # 4

Methods