71 papers with code • 3 benchmarks • 1 datasets
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.
Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations.
We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.
Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a self-attention mechanism.
In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image.