Phrase Grounding

36 papers with code • 5 benchmarks • 6 datasets

Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.

Source: Phrase Grounding by Soft-Label Chain Conditional Random Field

Benchmarks

Add a Result

These leaderboards are used to track progress in Phrase Grounding

Dataset	Best Model	Compare
Flickr30k Entities Test	GLIPv2	See all
Visual Genome	GbS VG	See all
Flickr30k	GBS Ensemble + 12-in-1	See all
ReferIt	VG_BiLSTM_VGG	See all
Flickr30k Entities Dev	Fiber-B	See all

Libraries

Use these libraries to find Phrase Grounding models and implementations

microsoft/GLIP

2 papers

1,970

Datasets

Subtasks

Grounded Open Vocabulary Acquisition

Most implemented papers

Most implemented Social Latest No code

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

thunlp/pevl • • 23 May 2022

We show that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs.

Paper
Code

GLIPv2: Unifying Localization and Vision-Language Understanding

microsoft/GLIP • • 12 Jun 2022

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e. g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e. g., VQA, image captioning).

Paper
Code

What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs

talshaharabany/what-is-where-by-looking • • 19 Jun 2022

Moreover, training takes place in a weakly supervised setting, where no bounding boxes are provided.

Paper
Code

OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network

om-ai-lab/OmDet • • 10 Sep 2022

The advancement of object detection (OD) in open-vocabulary and open-world scenarios is a critical challenge in computer vision.

Paper
Code

Extending Phrase Grounding with Pronouns in Visual Dialogues

izhx/phrase-grounding-with-pronoun • • 23 Oct 2022

First, we construct a dataset of phrase grounding with both noun phrases and pronouns to image regions.

Paper
Code

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

idea-research/dq-detr • • 28 Nov 2022

As phrase extraction can be regarded as a $1$D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction.

Paper
Code

Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding

talshaharabany/similarity-maps-for-self-training-weakly-supervised-phrase-grounding • CVPR 2023

A phrase grounding model receives an input image and a text phrase and outputs a suitable localization map.

Paper
Code

Trade-offs in Fine-tuned Diffusion Models Between Accuracy and Interpretability

mischad/chest-distillation • • 31 Mar 2023

Recent advancements in diffusion models have significantly impacted the trajectory of generative machine learning research, with many adopting the strategy of fine-tuning pre-trained models using domain-specific text-to-image datasets.

Paper
Code

A Survey on Interpretable Cross-modal Reasoning

ZuyiZhou/Awesome-Interpretable-Cross-modal-Reasoning • • 5 Sep 2023

In recent years, cross-modal reasoning (CMR), the process of understanding and reasoning across different modalities, has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics.

Paper
Code

A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

lil-lab/phrase_grounding • • 6 Sep 2023

Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions.

Paper
Code

Phrase Grounding

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result