Phrase Grounding

36 papers with code • 5 benchmarks • 6 datasets

Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.

Source: Phrase Grounding by Soft-Label Chain Conditional Random Field

Benchmarks

Add a Result

These leaderboards are used to track progress in Phrase Grounding

Dataset	Best Model	Compare
Flickr30k Entities Test	GLIPv2	See all
Visual Genome	GbS VG	See all
Flickr30k	GBS Ensemble + 12-in-1	See all
ReferIt	VG_BiLSTM_VGG	See all
Flickr30k Entities Dev	Fiber-B	See all

Libraries

Use these libraries to find Phrase Grounding models and implementations

microsoft/GLIP

2 papers

1,983

Datasets

Subtasks

Grounded Open Vocabulary Acquisition

Latest papers

Most implemented Social Latest No code

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

jefferyzhan/griffon • • 14 Mar 2024

Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpass the performance of task-specific experts in complex and dense scenarios.

14 Mar 2024

Paper
Code

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

open-mmlab/mmdetection • • 4 Jan 2024

Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC).

27,933

04 Jan 2024

Paper
Code

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

mbzuai-oryx/video-llava • • 22 Nov 2023

Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data.

201

22 Nov 2023

Paper
Code

Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models

amzn/augment-the-pairs-wacv2024 • • 5 Nov 2023

While we demonstrate our data augmentation method with MDETR framework, the proposed approach is applicable to common grounding-based vision and language tasks with other frameworks.

05 Nov 2023

Paper
Code

Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge

pluslabnlp/envision • • 23 Oct 2023

The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.

23 Oct 2023

Paper
Code

Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks

eyalgomel/box-based-refinement • • ICCV 2023

It has been established that training a box-based detector network can enhance the localization performance of weakly supervised and unsupervised methods.

07 Sep 2023

Paper
Code

A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

lil-lab/phrase_grounding • • 6 Sep 2023

Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions.

06 Sep 2023

Paper
Code

A Survey on Interpretable Cross-modal Reasoning

ZuyiZhou/Awesome-Interpretable-Cross-modal-Reasoning • • 5 Sep 2023

In recent years, cross-modal reasoning (CMR), the process of understanding and reasoning across different modalities, has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics.

05 Sep 2023

Paper
Code

Kosmos-2: Grounding Multimodal Large Language Models to the World

microsoft/unilm • • 26 Jun 2023

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e. g., bounding boxes) and grounding text to the visual world.

18,448

26 Jun 2023

Paper
Code

Trade-offs in Fine-tuned Diffusion Models Between Accuracy and Interpretability

mischad/chest-distillation • • 31 Mar 2023

Recent advancements in diffusion models have significantly impacted the trajectory of generative machine learning research, with many adopting the strategy of fine-tuning pre-trained models using domain-specific text-to-image datasets.

31 Mar 2023

Paper
Code

Phrase Grounding

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result