Visual Grounding
71 papers with code • 3 benchmarks • 1 datasets
Most implemented papers
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations.
Grounding of Textual Phrases in Images by Reconstruction
We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.
Revisiting Visual Question Answering Baselines
Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding.
Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat
We compare our approach to an alternative system which extends the baseline with reinforcement learning.
Dual Attention Networks for Visual Reference Resolution in Visual Dialog
Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a self-attention mechanism.
A Fast and Accurate One-Stage Approach to Visual Grounding
We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight.
Composing Pick-and-Place Tasks By Grounding Language
Controlling robots to perform tasks via natural language is one of the most challenging topics in human-robot interaction.
TransVG: End-to-End Visual Grounding with Transformers
In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image.
Word Discovery in Visually Grounded, Self-Supervised Speech Models
We present a method for visually-grounded spoken term discovery.