Phrase Grounding
36 papers with code • 5 benchmarks • 6 datasets
Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.
Source: Phrase Grounding by Soft-Label Chain Conditional Random Field
Libraries
Use these libraries to find Phrase Grounding models and implementationsLatest papers with no code
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language
Visual and linguistic pre-training aims to learn vision and language representations together, which can be transferred to visual-linguistic downstream tasks.
LIMITR: Leveraging Local Information for Medical Image-Text Representation
Furthermore, the model integrates domain-specific information of two types -- lateral images and the consistent visual structure of chest images.
Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection
Methods are mostly evaluated in terms of how well object class names are learned, but captions also contain rich attribute context that should be considered when learning object alignment.
Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment
To enable MedRPG to locate nuanced medical findings with better region-phrase correspondences, we further propose Tri-attention Context contrastive alignment (TaCo).
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing
Prior work in biomedical VLP has mostly relied on the alignment of single image and report pairs even though clinical notes commonly refer to prior images.
Detailed Annotations of Chest X-Rays via CT Projection for Report Understanding
To exploit anatomical structures in this scenario, we present a sophisticated automatic pipeline to gather and integrate human bodily structures from computed tomography datasets, which we incorporate in our PAXRay: A Projected dataset for the segmentation of Anatomical structures in X-Ray data.
Lite-MDETR: A Lightweight Multi-Modal Detector
The key primitive is that Dictionary-Lookup-Transformormations (DLT) is proposed to replace Linear Transformation (LT) in multi-modal detectors where each weight in Linear Transformation (LT) is approximately factorized into a smaller dictionary, index, and coefficient.
Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling
We introduce a new task, unsupervised vision-language (VL) grammar induction.
Disentangled Motif-aware Graph Learning for Phrase Grounding
In this paper, we propose a novel graph learning framework for phrase grounding in the image.
Utilizing Every Image Object for Semi-supervised Phrase Grounding
The annotated language queries available during training are limited, which also limits the variations of language combinations that a model can see during training.