ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.
PDF Abstract NeurIPS 2019 PDF NeurIPS 2019 AbstractCode
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Visual Question Answering (VQA) | A-OKVQA | ViLBERT - OK-VQA | MC Accuracy | 34.1 | # 9 | |
DA VQA Score | 9.2 | # 10 | ||||
Visual Question Answering (VQA) | A-OKVQA | ViLBERT - VQA | MC Accuracy | 42.1 | # 5 | |
DA VQA Score | 12.0 | # 9 | ||||
Visual Question Answering (VQA) | A-OKVQA | ViLBERT | MC Accuracy | 41.5 | # 7 | |
DA VQA Score | 25.9 | # 6 | ||||
Referring Expression Comprehension | Talk2Car | Vilbert (Base) | AP50 | 68.9 | # 5 | |
Visual Question Answering (VQA) | VQA v2 test-dev | ViLBERT | Accuracy | 70.55 | # 29 |