Prismer: A Vision-Language Model with An Ensemble of Experts

4 Mar 2023  ·  Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, Anima Anandkumar ·

Recent vision-language models have shown impressive multi-modal generation capabilities. However, typically they require training huge models on massive datasets. As a more scalable alternative, we introduce Prismer, a data- and parameter-efficient vision-language model that leverages an ensemble of domain experts. Prismer only requires training of a small number of components, with the majority of network weights inherited from readily-available, pre-trained domain experts, and kept frozen during training. By leveraging experts from a wide range of domains, we show that Prismer can efficiently pool this expert knowledge and adapt it to various vision-language reasoning tasks. In our experiments, we show that Prismer achieves fine-tuned and few-shot learning performance which is competitive with current state-of-the-art models, whilst requiring up to two orders of magnitude less training data. Code is available at https://github.com/NVlabs/prismer.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Captioning COCO Captions Prismer BLEU-4 40.4 # 18
METEOR 31.4 # 7
CIDER 136.5 # 20
SPICE 24.4 # 14
Image Captioning nocaps entire Prismer CIDEr 110.84 # 6
B1 84.87 # 6
B2 69.99 # 6
B3 52.48 # 6
B4 33.66 # 6
ROUGE-L 60.55 # 6
METEOR 31.13 # 6
SPICE 14.91 # 5
Image Captioning nocaps val Prismer CIDEr 107.9 # 1
SPICE 14.8 # 1
Visual Question Answering (VQA) VQA v2 test-dev Prismer Accuracy 78.43 # 16
Visual Question Answering (VQA) VQA v2 test-std Prismer overall 78.49 # 12
yes/no 93.09 # 4
number 61.39 # 6
other 69.70 # 4

Methods


No methods listed for this paper. Add relevant methods here