SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction of multiple dataset-specific objectives. In this work, we relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective. Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score). Furthermore, we demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.

PDF Abstract ICLR 2022 PDF ICLR 2022 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Captioning COCO Captions SimVLM BLEU-4 40.6 # 7
METEOR 33.4 # 1
CIDER 143.3 # 3
SPICE 25.4 # 3
Visual Reasoning NLVR2 Dev SimVLM Accuracy 84.53 # 3
Visual Reasoning NLVR2 Test SimVLM Accuracy 85.15 # 3
Image Captioning nocaps entire Single Model CIDEr 110.31 # 3
B1 83.78 # 2
B2 68.86 # 2
B3 51.06 # 2
B4 32.2 # 2
ROUGE-L 59.86 # 2
METEOR 30.55 # 2
SPICE 14.49 # 3
Image Captioning nocaps in-domain Single Model CIDEr 108.98 # 3
B1 84.64 # 2
B2 70.0 # 2
B3 52.96 # 3
B4 34.66 # 3
ROUGE-L 61.01 # 2
METEOR 31.97 # 2
SPICE 14.6 # 4
Image Captioning nocaps near-domain Single Model CIDEr 110.76 # 3
B1 84.36 # 2
B2 69.83 # 3
B3 52.42 # 2
B4 33.74 # 2
ROUGE-L 60.46 # 2
METEOR 30.97 # 2
SPICE 14.61 # 3
Image Captioning nocaps out-of-domain Single Model CIDEr 109.49 # 3
B1 80.89 # 2
B2 64.21 # 2
B3 44.38 # 2
B4 24.47 # 2
ROUGE-L 56.69 # 2
METEOR 27.91 # 2
SPICE 13.89 # 2
Image Captioning nocaps-val-in-domain SimVLM_huge CIDEr 113.7 # 3
SPICE - # 7
Pre-train (#images) 1.8B # 1
Image Captioning nocaps-val-near-domain SimVLM_huge CIDEr 110.9 # 3
SPICE - # 5
Pre-train (#images) 1.8B # 1
Image Captioning nocaps-val-out-domain SimVLM_huge CIDEr 115.2 # 2
SPICE - # 5
Pretrain (#images) 1.8B # 1
Image Captioning nocaps-val-overall SimVLM_huge CIDEr 112.2 # 3
SPICE - # 5
Pretrain (#images) 1.8B # 1
Visual Entailment SNLI-VE test SimVLM Accuracy 86.32 # 3
Visual Entailment SNLI-VE val SimVLM Accuracy 86.21 # 3
Visual Question Answering VQA v2 test-dev SimVLM Accuracy 80.03 # 4
Visual Question Answering VQA v2 test-std SimVLM overall 80.34 # 4

Methods