VL-BERT: Pre-training of Generic Visual-Linguistic Representations

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input... (read more)

PDF Abstract ICLR 2020 PDF ICLR 2020 Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Visual Question Answering VCR (Q-A) dev VL-BERTBASE Accuracy 73.8 # 2
Visual Question Answering VCR (Q-A) dev VL-BERTLARGE Accuracy 75.5 # 1
Visual Question Answering VCR (Q-AR) dev VL-BERTLARGE Accuracy 58.9 # 1
Visual Question Answering VCR (Q-AR) dev VL-BERTBASE Accuracy 55.2 # 2
Visual Question Answering VCR (QA-R) dev VL-BERTLARGE Accuracy 77.9 # 1
Visual Question Answering VCR (QA-R) dev VL-BERTBASE Accuracy 74.4 # 2
Visual Question Answering VCR (Q-AR) test VL-BERTLARGE Accuracy 59.7 # 3
Visual Question Answering VCR (QA-R) test VL-BERTLARGE Accuracy 78.4 # 4
Visual Question Answering VCR (Q-A) test VL-BERTLARGE Accuracy 75.8 # 4
Visual Question Answering VQA v2 test-dev VL-BERTBASE Accuracy 71.16 # 6
Visual Question Answering VQA v2 test-dev VL-BERTLARGE Accuracy 71.79 # 4
Visual Question Answering VQA v2 test-std VL-BERTLARGE overall 72.2 # 30

Methods used in the Paper


METHOD TYPE
Residual Connection
Skip Connections
Attention Dropout
Regularization
Linear Warmup With Linear Decay
Learning Rate Schedules
Weight Decay
Regularization
BPE
Subword Segmentation
GELU
Activation Functions
Dense Connections
Feedforward Networks
Label Smoothing
Regularization
ReLU
Activation Functions
Adam
Stochastic Optimization
WordPiece
Subword Segmentation
Softmax
Output Functions
Dropout
Regularization
Multi-Head Attention
Attention Modules
Layer Normalization
Normalization
Scaled Dot-Product Attention
Attention Mechanisms
Transformer
Transformers
BERT
Language Models