Visual Grounding

173 papers with code • 3 benchmarks • 5 datasets

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Grounding

Dataset	Best Model	Compare
RefCOCO+ testA	mPLUG-2	See all
RefCOCO+ test B	mPLUG-2	See all
RefCOCO+ val	mPLUG-2	See all

Libraries

Use these libraries to find Visual Grounding models and implementations

modelscope/modelscope

4 papers

6,039

Datasets

Subtasks

Latest papers

Most implemented Social Latest No code

MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis

biomedia-mbzuai/medpromptx • • 22 Mar 2024

Chest X-ray images are commonly used for predicting acute and chronic cardiopulmonary conditions, but efforts to integrate them with structured clinical data face challenges due to incomplete electronic health records (EHR).

22 Mar 2024

Paper
Code

Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling

EvLab-MIT/LexiContrastiveGrd • • 21 Mar 2024

Today's most accurate language models are trained on orders of magnitude more language data than human language learners receive - but with no supervision from other sensory modalities that play a crucial role in human learning.

21 Mar 2024

Paper
Code

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

qizekun/ReCon • • 27 Feb 2024

This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.

108

27 Feb 2024

Paper
Code

Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding

d-ailin/clip-guided-decoding • 23 Feb 2024

Large Vision-Language Models (LVLMs) are susceptible to object hallucinations, an issue in which their generated text contains non-existent objects, greatly limiting their reliability and practicality.

23 Feb 2024

Paper
Code

Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions

rubics-xuan/ivg • 17 Feb 2024

Previous datasets and methods for classic VG task mainly rely on the prior assumption that the given expression must literally refer to the target object, which greatly impedes the practical deployment of agents in real-world scenarios.

17 Feb 2024

Paper
Code

ChatterBox: Multi-round Multimodal Referring and Grounding

sunsmarterjie/chatterbox • • 24 Jan 2024

In this study, we establish a baseline for a new task named multimodal multi-round referring and grounding (MRG), opening up a promising direction for instance-level multimodal dialogues.

24 Jan 2024

Paper
Code

Unifying Visual and Vision-Language Tracking via Contrastive Learning

openspaceai/uvltrack • • 20 Jan 2024

Single object tracking aims to locate the target object in a video sequence according to the state specified by different modal references, including the initial bounding box (BBOX), natural language (NL), or both (NL+BBOX).

20 Jan 2024

Paper
Code

SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

zhanyang-nwpu/skyeyegpt • 18 Jan 2024

Specifically, after projecting RS visual features to the language domain via an alignment layer, they are fed jointly with task-specific instructions into an LLM-based RS decoder to predict answers for RS open-ended tasks.

18 Jan 2024

Paper
Code

Veagle: Advancements in Multimodal Representation Learning

superagi/veagle • • 18 Jan 2024

In response to the limitations observed in current Vision Language Models (VLMs) and Multimodal Large Language Models (MLLMs), our proposed model Veagle, incorporates a unique mechanism inspired by the successes and insights of previous works.

18 Jan 2024

Paper
Code

Uncovering the Full Potential of Visual Grounding Methods in VQA

dreichcsl/truevg • • 15 Jan 2024

In this study, we demonstrate that current evaluation schemes for VG-methods are problematic due to the flawed assumption of availability of relevant visual information.

15 Jan 2024

Paper
Code

Visual Grounding

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result