|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short).
Ranked #1 on Visual Question Answering on VCR (QA-R) dev
Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
Ranked #1 on Visual Reasoning on NLVR2 Test
In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression.
Ranked #3 on Referring Expression Segmentation on RefCOCO+ testA
We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.
Ranked #1 on Visual Question Answering on CLEVR-Humans
In addition, we address a key challenge in this multi-task setup, i. e., the prediction conflict, with two innovative designs namely, Consistency Energy Maximization (CEM) and Adaptive Soft Non-Located Suppression (ASNLS).
We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight.
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.
Ranked #2 on Referring Expression Comprehension on RefCoco
On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models.
The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions.
While prior work usually treats each sentence and attends it to an object separately, we focus on learning a referring expression comprehension model that considers the property in synonymous sentences.