LXMERT: Learning Cross-Modality Encoder Representations from Transformers

IJCNLP 2019  ·  Hao Tan, Mohit Bansal ·

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification), cross-modality matching, and image question answering. These tasks help in learning both intra-modality and cross-modality relationships. After fine-tuning from our pre-trained parameters, our model achieves the state-of-the-art results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our pre-trained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR2, and improve the previous best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel model components and pre-training strategies significantly contribute to our strong results; and also present several attention visualizations for the different encoders. Code and pre-trained models publicly available at: https://github.com/airsplay/lxmert

PDF Abstract IJCNLP 2019 PDF IJCNLP 2019 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Visual Question Answering (VQA) A-OKVQA LXMERT MC Accuracy 41.6 # 8
DA VQA Score 25.9 # 9
Visual Question Answering (VQA) GQA Test2019 LXR955, Ensemble Accuracy 62.71 # 15
Binary 79.79 # 13
Open 47.64 # 15
Consistency 93.1 # 10
Plausibility 85.21 # 15
Validity 96.36 # 39
Distribution 6.42 # 39
Visual Question Answering (VQA) GQA Test2019 LXR955, Single Model Accuracy 60.33 # 36
Binary 77.16 # 41
Open 45.47 # 35
Consistency 89.59 # 42
Plausibility 84.53 # 74
Validity 96.35 # 47
Distribution 5.69 # 74
Visual Question Answering (VQA) GQA test-dev LXMERT (Pre-train + scratch) Accuracy 60.0 # 5
Visual Question Answering (VQA) GQA test-std LXMERT Accuracy 60.3 # 4
Visual Reasoning NLVR2 Dev LXMERT (Pre-train + scratch) Accuracy 74.9 # 13
Visual Reasoning NLVR2 Test LXMERT Accuracy 76.2 # 12
Visual Question Answering (VQA) VizWiz 2018 LXR955, No Ensemble overall 55.4 # 1
yes/no 74.0 # 1
number 24.76 # 3
other 39.0 # 1
unanswerable 82.26 # 5
Visual Question Answering (VQA) VQA v2 test-dev LXMERT (Pre-train + scratch) Accuracy 69.9 # 31
Visual Question Answering (VQA) VQA v2 test-std LXMERT overall 72.5 # 22

Methods