Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?


ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

facebookresearch/vilbert-multi-task NeurIPS 2019

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

akirafukui/vqa-mcb EMNLP 2016

Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations.

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

alibaba/AliceMind 1 Feb 2023

In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.

Grounding of Textual Phrases in Images by Reconstruction

akirafukui/vqa-mcb 12 Nov 2015

We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.

Revisiting Visual Question Answering Baselines

Cold-Winter/vqs 27 Jun 2016

Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding.

Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat

shekharRavi/Beyond-Task-Success-NAACL2019 NAACL 2019

We compare our approach to an alternative system which extends the baseline with reinforcement learning.

Word Discovery in Visually Grounded, Self-Supervised Speech Models

kamperh/vqwordseg 28 Mar 2022

We present a method for visually-grounded spoken term discovery.

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

gicheonkang/DAN-VisDial IJCNLP 2019

Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a self-attention mechanism.

A Fast and Accurate One-Stage Approach to Visual Grounding

zyang-ur/onestage_grounding ICCV 2019

We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight.

Learning Cross-modal Context Graph for Visual Grounding

youngfly11/LCMCG-PyTorch 20 Nov 2019

To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.