Visual Grounding

174 papers with code • 3 benchmarks • 5 datasets

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Grounding

Dataset	Best Model	Compare
RefCOCO+ testA	mPLUG-2	See all
RefCOCO+ test B	mPLUG-2	See all
RefCOCO+ val	mPLUG-2	See all

Libraries

Use these libraries to find Visual Grounding models and implementations

modelscope/modelscope

4 papers

5,985

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

SeqTR: A Simple yet Universal Network for Visual Grounding

sean-zhuh/seqtr • • 30 Mar 2022

In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e. g., phrase localization, referring expression comprehension (REC) and segmentation (RES).

Paper
Code

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

alibaba/AliceMind • • 24 May 2022

Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks.

Paper
Code

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

microsoft/SoM • • 17 Oct 2023

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V.

Paper
Code

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

qizekun/ShapeLLM • • 27 Feb 2024

This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.

Paper
Code

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

gicheonkang/DAN-VisDial • • IJCNLP 2019

Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a self-attention mechanism.

Paper
Code

A Fast and Accurate One-Stage Approach to Visual Grounding

zyang-ur/onestage_grounding • • ICCV 2019

We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight.

Paper
Code

Learning Cross-modal Context Graph for Visual Grounding

youngfly11/LCMCG-PyTorch • • 20 Nov 2019

To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.

Paper
Code

Composing Pick-and-Place Tasks By Grounding Language

mees/AIS-Alexa-Robot • 16 Feb 2021

Controlling robots to perform tasks via natural language is one of the most challenging topics in human-robot interaction.

Paper
Code

Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images

unclemedm/instpifu • • CVPR 2021

Grounding referring expressions in RGBD image has been an emerging field.

Paper
Code

TransVG: End-to-End Visual Grounding with Transformers

djiajunustc/TransVG • • ICCV 2021

In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image.

Paper
Code

Visual Grounding

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result