Large-Scale Adversarial Training for Vision-and-Language Representation Learning

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the "free" adversarial training strategy, and combine it with KL-divergence-based regularization to promote higher invariance in the embedding space. We apply VILLA to current best-performing V+L models, and achieve new state of the art on a wide range of tasks, including Visual Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval, Referring Expression Comprehension, Visual Entailment, and NLVR2.

PDF Abstract NeurIPS 2020 PDF NeurIPS 2020 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Referring Expression Comprehension RefCoco+ VILLA-large Val 76.17 # 7
Test A 81.54 # 8
Test B 66.84 # 8
Referring Expression Comprehension RefCOCO VILLA-large Val 82.39 # 10
Test A 87.48 # 10
Test B 74.84 # 12
Referring Expression Comprehension RefCOCOg-test VILLA-large Accuracy 76.71 # 7
Referring Expression Comprehension RefCOCOg-val VILLA-large Accuracy 76.18 # 7
Visual Entailment SNLI-VE val VILLA-LARGE Accuracy 80.18 # 7


No methods listed for this paper. Add relevant methods here