Phrase Grounding
36 papers with code • 5 benchmarks • 6 datasets
Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.
Source: Phrase Grounding by Soft-Label Chain Conditional Random Field
Libraries
Use these libraries to find Phrase Grounding models and implementationsMost implemented papers
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models
We show that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs.
GLIPv2: Unifying Localization and Vision-Language Understanding
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e. g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e. g., VQA, image captioning).
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs
Moreover, training takes place in a weakly supervised setting, where no bounding boxes are provided.
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network
The advancement of object detection (OD) in open-vocabulary and open-world scenarios is a critical challenge in computer vision.
Extending Phrase Grounding with Pronouns in Visual Dialogues
First, we construct a dataset of phrase grounding with both noun phrases and pronouns to image regions.
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding
As phrase extraction can be regarded as a $1$D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction.
Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding
A phrase grounding model receives an input image and a text phrase and outputs a suitable localization map.
Trade-offs in Fine-tuned Diffusion Models Between Accuracy and Interpretability
Recent advancements in diffusion models have significantly impacted the trajectory of generative machine learning research, with many adopting the strategy of fine-tuning pre-trained models using domain-specific text-to-image datasets.
A Survey on Interpretable Cross-modal Reasoning
In recent years, cross-modal reasoning (CMR), the process of understanding and reasoning across different modalities, has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics.
A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions.