Phrase Grounding

36 papers with code • 5 benchmarks • 6 datasets

Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.

Source: Phrase Grounding by Soft-Label Chain Conditional Random Field

Libraries

Use these libraries to find Phrase Grounding models and implementations
2 papers
1,957

Latest papers with no code

Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models

no code yet • 19 Apr 2024

In this work, we use a publicly available Foundation Model, namely the Latent Diffusion Model, to solve this challenging task.

MedRG: Medical Report Grounding with Multi-modal Large Language Model

no code yet • 10 Apr 2024

Medical Report Grounding is pivotal in identifying the most relevant regions in medical images based on a given phrase query, a critical aspect in medical image analysis and radiological diagnosis.

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

no code yet • 4 Mar 2024

Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest.

How to Understand "Support"? An Implicit-enhanced Causal Inference Approach for Weakly-supervised Phrase Grounding

no code yet • 29 Feb 2024

Weakly-supervised Phrase Grounding (WPG) is an emerging task of inferring the fine-grained phrase-region matching, while merely leveraging the coarse-grained sentence-image pairs for training.

Phrase Grounding-based Style Transfer for Single-Domain Generalized Object Detection

no code yet • 2 Feb 2024

Single-domain generalized object detection aims to enhance a model's generalizability to multiple unseen target domains using only data from a single source domain during training.

Enhancing the vision-language foundation model with key semantic knowledge-emphasized report refinement

no code yet • 21 Jan 2024

Particularly, raw radiology reports are refined to highlight the key information according to a constructed clinical dictionary and two model-optimized knowledge-enhancement metrics.

Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes in Product Images for e-commerce Vision-Language Applications

no code yet • 30 Aug 2023

We present Catalog Phrase Grounding (CPG), a model that can associate product textual data (title, brands) into corresponding regions of product images (isolated product region, brand logo region) for e-commerce vision-language applications.

Read, look and detect: Bounding box annotation from image-caption pairs

no code yet • 9 Jun 2023

Various methods have been proposed to detect objects while reducing the cost of data annotation.

ELVIS: Empowering Locality of Vision Language Pre-training with Intra-modal Similarity

no code yet • 11 Apr 2023

Deep learning has shown great potential in assisting radiologists in reading chest X-ray (CXR) images, but its need for expensive annotations for improving performance prevents widespread clinical application.

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

no code yet • 10 Apr 2023

Visual and linguistic pre-training aims to learn vision and language representations together, which can be transferred to visual-linguistic downstream tasks.