Visual Grounding

195 papers with code • 3 benchmarks • 7 datasets

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Libraries

Use these libraries to find Visual Grounding models and implementations

Most implemented papers

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

facebookresearch/vilbert-multi-task NeurIPS 2019

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

akirafukui/vqa-mcb EMNLP 2016

Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations.

ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance

ivan-tang-3d/viewrefer3d 29 Mar 2023

In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities.

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

ofa-sys/ofa 7 Feb 2022

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization.

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

alibaba/AliceMind 1 Feb 2023

In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.

Grounding of Textual Phrases in Images by Reconstruction

akirafukui/vqa-mcb 12 Nov 2015

We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.

Revisiting Visual Question Answering Baselines

Cold-Winter/vqs 27 Jun 2016

Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding.

Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat

shekharRavi/Beyond-Task-Success-NAACL2019 NAACL 2019

We compare our approach to an alternative system which extends the baseline with reinforcement learning.

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

ashkamath/mdetr 26 Apr 2021

We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.

Word Discovery in Visually Grounded, Self-Supervised Speech Models

kamperh/vqwordseg 28 Mar 2022

We present a method for visually-grounded spoken term discovery.