Phrase Grounding

36 papers with code • 5 benchmarks • 6 datasets

Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.

Source: Phrase Grounding by Soft-Label Chain Conditional Random Field

Libraries

Use these libraries to find Phrase Grounding models and implementations
2 papers
1,970

Most implemented papers

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

thunlp/pevl 23 May 2022

We show that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs.

GLIPv2: Unifying Localization and Vision-Language Understanding

microsoft/GLIP 12 Jun 2022

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e. g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e. g., VQA, image captioning).

What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs

talshaharabany/what-is-where-by-looking 19 Jun 2022

Moreover, training takes place in a weakly supervised setting, where no bounding boxes are provided.

OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network

om-ai-lab/OmDet 10 Sep 2022

The advancement of object detection (OD) in open-vocabulary and open-world scenarios is a critical challenge in computer vision.

Extending Phrase Grounding with Pronouns in Visual Dialogues

izhx/phrase-grounding-with-pronoun 23 Oct 2022

First, we construct a dataset of phrase grounding with both noun phrases and pronouns to image regions.

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

idea-research/dq-detr 28 Nov 2022

As phrase extraction can be regarded as a $1$D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction.

Trade-offs in Fine-tuned Diffusion Models Between Accuracy and Interpretability

mischad/chest-distillation 31 Mar 2023

Recent advancements in diffusion models have significantly impacted the trajectory of generative machine learning research, with many adopting the strategy of fine-tuning pre-trained models using domain-specific text-to-image datasets.

A Survey on Interpretable Cross-modal Reasoning

ZuyiZhou/Awesome-Interpretable-Cross-modal-Reasoning 5 Sep 2023

In recent years, cross-modal reasoning (CMR), the process of understanding and reasoning across different modalities, has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics.

A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

lil-lab/phrase_grounding 6 Sep 2023

Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions.