Phrase Grounding

36 papers with code • 5 benchmarks • 6 datasets

Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.

Source: Phrase Grounding by Soft-Label Chain Conditional Random Field

Libraries

Use these libraries to find Phrase Grounding models and implementations
2 papers
1,974

Latest papers with no code

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

no code yet • 10 Apr 2023

Visual and linguistic pre-training aims to learn vision and language representations together, which can be transferred to visual-linguistic downstream tasks.

LIMITR: Leveraging Local Information for Medical Image-Text Representation

no code yet • ICCV 2023

Furthermore, the model integrates domain-specific information of two types -- lateral images and the consistent visual structure of chest images.

Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection

no code yet • 17 Mar 2023

Methods are mostly evaluated in terms of how well object class names are learned, but captions also contain rich attribute context that should be considered when learning object alignment.

Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment

no code yet • 14 Mar 2023

To enable MedRPG to locate nuanced medical findings with better region-phrase correspondences, we further propose Tri-attention Context contrastive alignment (TaCo).

Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing

no code yet • CVPR 2023

Prior work in biomedical VLP has mostly relied on the alignment of single image and report pairs even though clinical notes commonly refer to prior images.

Detailed Annotations of Chest X-Rays via CT Projection for Report Understanding

no code yet • 7 Oct 2022

To exploit anatomical structures in this scenario, we present a sophisticated automatic pipeline to gather and integrate human bodily structures from computed tomography datasets, which we incorporate in our PAXRay: A Projected dataset for the segmentation of Anatomical structures in X-Ray data.

Lite-MDETR: A Lightweight Multi-Modal Detector

no code yet • CVPR 2022

The key primitive is that Dictionary-Lookup-Transformormations (DLT) is proposed to replace Linear Transformation (LT) in multi-modal detectors where each weight in Linear Transformation (LT) is approximately factorized into a smaller dictionary, index, and coefficient.

Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling

no code yet • ICLR 2022

We introduce a new task, unsupervised vision-language (VL) grammar induction.

Disentangled Motif-aware Graph Learning for Phrase Grounding

no code yet • 13 Apr 2021

In this paper, we propose a novel graph learning framework for phrase grounding in the image.

Utilizing Every Image Object for Semi-supervised Phrase Grounding

no code yet • 5 Nov 2020

The annotated language queries available during training are limited, which also limits the variations of language combinations that a model can see during training.