Visual Grounding

173 papers with code • 3 benchmarks • 5 datasets

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Libraries

Use these libraries to find Visual Grounding models and implementations

MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis

biomedia-mbzuai/medpromptx 22 Mar 2024

Chest X-ray images are commonly used for predicting acute and chronic cardiopulmonary conditions, but efforts to integrate them with structured clinical data face challenges due to incomplete electronic health records (EHR).

43
22 Mar 2024

Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling

EvLab-MIT/LexiContrastiveGrd 21 Mar 2024

Today's most accurate language models are trained on orders of magnitude more language data than human language learners receive - but with no supervision from other sensory modalities that play a crucial role in human learning.

1
21 Mar 2024

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

qizekun/ReCon 27 Feb 2024

This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.

108
27 Feb 2024

Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding

d-ailin/clip-guided-decoding 23 Feb 2024

Large Vision-Language Models (LVLMs) are susceptible to object hallucinations, an issue in which their generated text contains non-existent objects, greatly limiting their reliability and practicality.

0
23 Feb 2024

Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions

rubics-xuan/ivg 17 Feb 2024

Previous datasets and methods for classic VG task mainly rely on the prior assumption that the given expression must literally refer to the target object, which greatly impedes the practical deployment of agents in real-world scenarios.

11
17 Feb 2024

ChatterBox: Multi-round Multimodal Referring and Grounding

sunsmarterjie/chatterbox 24 Jan 2024

In this study, we establish a baseline for a new task named multimodal multi-round referring and grounding (MRG), opening up a promising direction for instance-level multimodal dialogues.

44
24 Jan 2024

Unifying Visual and Vision-Language Tracking via Contrastive Learning

openspaceai/uvltrack 20 Jan 2024

Single object tracking aims to locate the target object in a video sequence according to the state specified by different modal references, including the initial bounding box (BBOX), natural language (NL), or both (NL+BBOX).

13
20 Jan 2024

SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

zhanyang-nwpu/skyeyegpt 18 Jan 2024

Specifically, after projecting RS visual features to the language domain via an alignment layer, they are fed jointly with task-specific instructions into an LLM-based RS decoder to predict answers for RS open-ended tasks.

35
18 Jan 2024

Veagle: Advancements in Multimodal Representation Learning

superagi/veagle 18 Jan 2024

In response to the limitations observed in current Vision Language Models (VLMs) and Multimodal Large Language Models (MLLMs), our proposed model Veagle, incorporates a unique mechanism inspired by the successes and insights of previous works.

34
18 Jan 2024

Uncovering the Full Potential of Visual Grounding Methods in VQA

dreichcsl/truevg 15 Jan 2024

In this study, we demonstrate that current evaluation schemes for VG-methods are problematic due to the flawed assumption of availability of relevant visual information.

0
15 Jan 2024