Phrase Grounding

36 papers with code • 5 benchmarks • 6 datasets

Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.

Source: Phrase Grounding by Soft-Label Chain Conditional Random Field

Benchmarks

Add a Result

These leaderboards are used to track progress in Phrase Grounding

Dataset	Best Model	Compare
Flickr30k Entities Test	GLIPv2	See all
Visual Genome	GbS VG	See all
Flickr30k	GBS Ensemble + 12-in-1	See all
ReferIt	VG_BiLSTM_VGG	See all
Flickr30k Entities Dev	Fiber-B	See all

Libraries

Use these libraries to find Phrase Grounding models and implementations

microsoft/GLIP

2 papers

1,974

Datasets

Subtasks

Grounded Open Vocabulary Acquisition

Latest papers with no code

Most implemented Social Latest No code

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

no code yet • 10 Apr 2023

Visual and linguistic pre-training aims to learn vision and language representations together, which can be transferred to visual-linguistic downstream tasks.

Paper
Add Code

LIMITR: Leveraging Local Information for Medical Image-Text Representation

no code yet • ICCV 2023

Furthermore, the model integrates domain-specific information of two types -- lateral images and the consistent visual structure of chest images.

Paper
Add Code

Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection

no code yet • 17 Mar 2023

Methods are mostly evaluated in terms of how well object class names are learned, but captions also contain rich attribute context that should be considered when learning object alignment.

Paper
Add Code

Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment

no code yet • 14 Mar 2023

To enable MedRPG to locate nuanced medical findings with better region-phrase correspondences, we further propose Tri-attention Context contrastive alignment (TaCo).

Paper
Add Code

Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing

no code yet • CVPR 2023

Prior work in biomedical VLP has mostly relied on the alignment of single image and report pairs even though clinical notes commonly refer to prior images.

Paper
Add Code

Detailed Annotations of Chest X-Rays via CT Projection for Report Understanding

no code yet • 7 Oct 2022

To exploit anatomical structures in this scenario, we present a sophisticated automatic pipeline to gather and integrate human bodily structures from computed tomography datasets, which we incorporate in our PAXRay: A Projected dataset for the segmentation of Anatomical structures in X-Ray data.

Paper
Add Code

Lite-MDETR: A Lightweight Multi-Modal Detector

no code yet • CVPR 2022

The key primitive is that Dictionary-Lookup-Transformormations (DLT) is proposed to replace Linear Transformation (LT) in multi-modal detectors where each weight in Linear Transformation (LT) is approximately factorized into a smaller dictionary, index, and coefficient.

Paper
Add Code

Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling

no code yet • ICLR 2022

We introduce a new task, unsupervised vision-language (VL) grammar induction.

Paper
Add Code

Disentangled Motif-aware Graph Learning for Phrase Grounding

no code yet • 13 Apr 2021

In this paper, we propose a novel graph learning framework for phrase grounding in the image.

Paper
Add Code

Utilizing Every Image Object for Semi-supervised Phrase Grounding

no code yet • 5 Nov 2020

The annotated language queries available during training are limited, which also limits the variations of language combinations that a model can see during training.

Paper
Add Code

Phrase Grounding

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers with no code

Content

Benchmarks

Add a Result