LXMERT: Learning Cross-Modality Encoder Representations from Transformers

IJCNLP 2019 Hao TanMohit Bansal

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT LEADERBOARD
Visual Question Answering GQA test-dev LXMERT (Pre-train + scratch) Accuracy 60.0 # 2
Visual Question Answering GQA test-std LXMERT Accuracy 60.3 # 2
Visual Reasoning NLVR2 Dev LXMERT (Pre-train + scratch) Accuracy 74.9 # 1
Visual Reasoning NLVR2 Test LXMERT Accuracy 76.2 # 2
Visual Question Answering VizWiz LXMERT Accuracy 55.4% # 1
Visual Question Answering VQA v2 test-dev LXMERT (Pre-train + scratch) Accuracy 69.9 # 8
Visual Question Answering VQA v2 test-std LXMERT Accuracy 72.5 # 3