Phrase Grounding
36 papers with code • 5 benchmarks • 6 datasets
Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.
Source: Phrase Grounding by Soft-Label Chain Conditional Random Field
Libraries
Use these libraries to find Phrase Grounding models and implementationsLatest papers with no code
Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models
In this work, we use a publicly available Foundation Model, namely the Latent Diffusion Model, to solve this challenging task.
MedRG: Medical Report Grounding with Multi-modal Large Language Model
Medical Report Grounding is pivotal in identifying the most relevant regions in medical images based on a given phrase query, a critical aspect in medical image analysis and radiological diagnosis.
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest.
How to Understand "Support"? An Implicit-enhanced Causal Inference Approach for Weakly-supervised Phrase Grounding
Weakly-supervised Phrase Grounding (WPG) is an emerging task of inferring the fine-grained phrase-region matching, while merely leveraging the coarse-grained sentence-image pairs for training.
Phrase Grounding-based Style Transfer for Single-Domain Generalized Object Detection
Single-domain generalized object detection aims to enhance a model's generalizability to multiple unseen target domains using only data from a single source domain during training.
Enhancing the vision-language foundation model with key semantic knowledge-emphasized report refinement
Particularly, raw radiology reports are refined to highlight the key information according to a constructed clinical dictionary and two model-optimized knowledge-enhancement metrics.
Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes in Product Images for e-commerce Vision-Language Applications
We present Catalog Phrase Grounding (CPG), a model that can associate product textual data (title, brands) into corresponding regions of product images (isolated product region, brand logo region) for e-commerce vision-language applications.
Read, look and detect: Bounding box annotation from image-caption pairs
Various methods have been proposed to detect objects while reducing the cost of data annotation.
ELVIS: Empowering Locality of Vision Language Pre-training with Intra-modal Similarity
Deep learning has shown great potential in assisting radiologists in reading chest X-ray (CXR) images, but its need for expensive annotations for improving performance prevents widespread clinical application.
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language
Visual and linguistic pre-training aims to learn vision and language representations together, which can be transferred to visual-linguistic downstream tasks.