LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections... (read more)

PDF Abstract IJCNLP 2019 PDF IJCNLP 2019 Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Visual Question Answering GQA Test2019 LXR955, Ensemble Accuracy 62.71 # 12
Binary 79.79 # 9
Open 47.64 # 12
Consistency 93.1 # 7
Plausibility 85.21 # 14
Validity 96.36 # 30
Distribution 6.42 # 36
Visual Question Answering GQA Test2019 LXR955, Single Model Accuracy 60.33 # 28
Binary 77.16 # 33
Open 45.47 # 29
Consistency 89.59 # 34
Plausibility 84.53 # 67
Validity 96.35 # 38
Distribution 5.69 # 67
Visual Question Answering GQA test-dev LXMERT (Pre-train + scratch) Accuracy 60.0 # 2
Visual Question Answering GQA test-std LXMERT Accuracy 60.3 # 3
Visual Reasoning NLVR2 Dev LXMERT (Pre-train + scratch) Accuracy 74.9 # 1
Visual Reasoning NLVR2 Test LXMERT Accuracy 76.2 # 2
Visual Question Answering VizWiz 2018 LXR955, No Ensemble overall 55.4 # 1
yes/no 74.0 # 1
number 24.76 # 3
other 39.0 # 1
unanswerable 82.26 # 5
Visual Question Answering VQA v2 test-dev LXMERT (Pre-train + scratch) Accuracy 69.9 # 10
Visual Question Answering VQA v2 test-std LXMERT overall 72.5 # 28

Methods used in the Paper


METHOD TYPE
Residual Connection
Skip Connections
BPE
Subword Segmentation
Dense Connections
Feedforward Networks
Label Smoothing
Regularization
ReLU
Activation Functions
Adam
Stochastic Optimization
Softmax
Output Functions
Dropout
Regularization
Multi-Head Attention
Attention Modules
Layer Normalization
Normalization
Scaled Dot-Product Attention
Attention Mechanisms
Transformer
Transformers