Phrase Grounding
36 papers with code • 5 benchmarks • 6 datasets
Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.
Source: Phrase Grounding by Soft-Label Chain Conditional Random Field
Libraries
Use these libraries to find Phrase Grounding models and implementationsMost implemented papers
Learning Cross-modal Context Graph for Visual Grounding
To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.
Contrastive Learning for Weakly Supervised Phrase Grounding
Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions.
Neural Parameter Allocation Search
We introduce Neural Parameter Allocation Search (NPAS), a novel task where the goal is to train a neural network given an arbitrary, fixed parameter budget.
Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation
Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed.
MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding
Phrase localization is a task that studies the mapping from textual phrases to regions of an image.
Learning to ground medical text in a 3D human atlas
In this paper, we develop a method for grounding medical text into a physically meaningful and interpretable space corresponding to a human atlas.
MDETR - Modulated Detection for End-to-End Multi-Modal Understanding
We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.
Detector-Free Weakly Supervised Grounding by Separation
In this work, we focus on the task of Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector.
Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships
Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly.
Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing
We release a new dataset with locally-aligned phrase grounding annotations by radiologists to facilitate the study of complex semantic modelling in biomedical vision--language processing.