Visual Grounding
173 papers with code • 3 benchmarks • 5 datasets
Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:
- What is the main focus in a query?
- How to understand an image?
- How to locate an object?
Libraries
Use these libraries to find Visual Grounding models and implementationsMost implemented papers
SeqTR: A Simple yet Universal Network for Visual Grounding
In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e. g., phrase localization, referring expression comprehension (REC) and segmentation (RES).
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks.
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance
In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V.
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.
Dual Attention Networks for Visual Reference Resolution in Visual Dialog
Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a self-attention mechanism.
A Fast and Accurate One-Stage Approach to Visual Grounding
We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight.
Learning Cross-modal Context Graph for Visual Grounding
To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.
Composing Pick-and-Place Tasks By Grounding Language
Controlling robots to perform tasks via natural language is one of the most challenging topics in human-robot interaction.
Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images
Grounding referring expressions in RGBD image has been an emerging field.