VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval. Moreover, we propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs. Experimental results show that VLMo achieves state-of-the-art results on various vision-language tasks, including VQA, NLVR2 and image-text retrieval. The code and pretrained models are available at https://aka.ms/vlmo.

PDF Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Text Retrieval Image-Chat VLMo R@1 46.8 # 3
R@5 67.5 # 3
Sum(R@1,5) 114.3 # 3
Visual Reasoning NLVR2 Dev VLMo Accuracy 85.64 # 6
Visual Reasoning NLVR2 Test VLMo Accuracy 86.86 # 6
Image Retrieval PhotoChat VLMo R1 11.5 # 2
R@5 30.0 # 3
R@10 39.4 # 2
Sum(R@1,5,10) 83.2 # 2
Visual Question Answering (VQA) VQA v2 test-dev VLMo Accuracy 82.78 # 3
Visual Question Answering (VQA) VQA v2 test-std VLMo overall 81.30 # 5
yes/no 94.68 # 3
number 67.26 # 3
other 72.87 # 3