VisualBERT: A Simple and Performant Baseline for Vision and Language

9 Aug 2019Liunian Harold LiMark YatskarDa YinCho-Jui HsiehKai-Wei Chang

We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention... (read more)

PDF Abstract

Evaluation results from the paper


Task Dataset Model Metric name Metric value Global rank Compare
Visual Reasoning NLVR VisualBERT Accuracy (Dev) 67.4% # 2
Visual Reasoning NLVR VisualBERT Accuracy (Test-P) 67.0% # 2
Visual Reasoning NLVR VisualBERT Accuracy (Test-U) 67.3% # 2