Visual Grounding

174 papers with code • 3 benchmarks • 5 datasets

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Libraries

Use these libraries to find Visual Grounding models and implementations

Most implemented papers

SeqTR: A Simple yet Universal Network for Visual Grounding

sean-zhuh/seqtr 30 Mar 2022

In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e. g., phrase localization, referring expression comprehension (REC) and segmentation (RES).

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

alibaba/AliceMind 24 May 2022

Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks.

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

microsoft/SoM 17 Oct 2023

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V.

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

qizekun/ShapeLLM 27 Feb 2024

This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

gicheonkang/DAN-VisDial IJCNLP 2019

Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a self-attention mechanism.

A Fast and Accurate One-Stage Approach to Visual Grounding

zyang-ur/onestage_grounding ICCV 2019

We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight.

Learning Cross-modal Context Graph for Visual Grounding

youngfly11/LCMCG-PyTorch 20 Nov 2019

To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.

Composing Pick-and-Place Tasks By Grounding Language

mees/AIS-Alexa-Robot 16 Feb 2021

Controlling robots to perform tasks via natural language is one of the most challenging topics in human-robot interaction.

TransVG: End-to-End Visual Grounding with Transformers

djiajunustc/TransVG ICCV 2021

In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image.