Visual Grounding

291 papers with code • 4 benchmarks • 11 datasets

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Libraries

Use these libraries to find Visual Grounding models and implementations

Most implemented papers

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

facebookresearch/vilbert-multi-task NeurIPS 2019

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

akirafukui/vqa-mcb EMNLP 2016

Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations.

ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance

ivan-tang-3d/viewrefer3d 29 Mar 2023

In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities.

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

ashkamath/mdetr 26 Apr 2021

We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

ofa-sys/ofa 7 Feb 2022

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization.

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

alibaba/AliceMind 1 Feb 2023

In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.

Towards Visual Grounding: A Survey

linhuixiao/awesome-visual-grounding 28 Dec 2024

Finally, we outline the challenges confronting visual grounding and propose valuable directions for future research, which may serve as inspiration for subsequent researchers.

Grounding of Textual Phrases in Images by Reconstruction

akirafukui/vqa-mcb 12 Nov 2015

We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.

Revisiting Visual Question Answering Baselines

Cold-Winter/vqs 27 Jun 2016

Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding.

Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat

shekharRavi/Beyond-Task-Success-NAACL2019 NAACL 2019

We compare our approach to an alternative system which extends the baseline with reinforcement learning.