Phrase Grounding

36 papers with code • 5 benchmarks • 6 datasets

Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.

Source: Phrase Grounding by Soft-Label Chain Conditional Random Field

Libraries

Use these libraries to find Phrase Grounding models and implementations
2 papers
1,983

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

jefferyzhan/griffon 14 Mar 2024

Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpass the performance of task-specific experts in complex and dense scenarios.

70
14 Mar 2024

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

open-mmlab/mmdetection 4 Jan 2024

Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC).

27,933
04 Jan 2024

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

mbzuai-oryx/video-llava 22 Nov 2023

Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data.

201
22 Nov 2023

Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models

amzn/augment-the-pairs-wacv2024 5 Nov 2023

While we demonstrate our data augmentation method with MDETR framework, the proposed approach is applicable to common grounding-based vision and language tasks with other frameworks.

2
05 Nov 2023

Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge

pluslabnlp/envision 23 Oct 2023

The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.

4
23 Oct 2023

Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks

eyalgomel/box-based-refinement ICCV 2023

It has been established that training a box-based detector network can enhance the localization performance of weakly supervised and unsupervised methods.

6
07 Sep 2023

A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

lil-lab/phrase_grounding 6 Sep 2023

Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions.

1
06 Sep 2023

A Survey on Interpretable Cross-modal Reasoning

ZuyiZhou/Awesome-Interpretable-Cross-modal-Reasoning 5 Sep 2023

In recent years, cross-modal reasoning (CMR), the process of understanding and reasoning across different modalities, has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics.

13
05 Sep 2023

Kosmos-2: Grounding Multimodal Large Language Models to the World

microsoft/unilm 26 Jun 2023

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e. g., bounding boxes) and grounding text to the visual world.

18,448
26 Jun 2023

Trade-offs in Fine-tuned Diffusion Models Between Accuracy and Interpretability

mischad/chest-distillation 31 Mar 2023

Recent advancements in diffusion models have significantly impacted the trajectory of generative machine learning research, with many adopting the strategy of fine-tuning pre-trained models using domain-specific text-to-image datasets.

2
31 Mar 2023