LXMERT: Learning Cross-Modality Encoder Representations from Transformers

20 Aug 2019Hao TanMohit Bansal

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections... (read more)

PDF Abstract

Evaluation results from the paper

Task Dataset Model Metric name Metric value Global rank Compare
Visual Reasoning NLVR LXMERT Accuracy (Dev) 74.9% # 1
Visual Reasoning NLVR LXMERT Accuracy (Test-P) 74.5% # 1
Visual Reasoning NLVR LXMERT Accuracy (Test-U) 76.2% # 1
Visual Question Answering VizWiz LXMERT Accuracy 55.40% # 1
Visual Question Answering VQA v2 LXMERT Accuracy 72.54% # 1