Visual Grounding
117 papers with code • 3 benchmarks • 5 datasets
Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:
- What is the main focus in a query?
- How to understand an image?
- How to locate an object?
Libraries
Use these libraries to find Visual Grounding models and implementationsMost implemented papers
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations.
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
Grounding of Textual Phrases in Images by Reconstruction
We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.
Revisiting Visual Question Answering Baselines
Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding.
Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat
We compare our approach to an alternative system which extends the baseline with reinforcement learning.
Word Discovery in Visually Grounded, Self-Supervised Speech Models
We present a method for visually-grounded spoken term discovery.
Dual Attention Networks for Visual Reference Resolution in Visual Dialog
Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a self-attention mechanism.
A Fast and Accurate One-Stage Approach to Visual Grounding
We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight.
Learning Cross-modal Context Graph for Visual Grounding
To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.