Phrase Grounding
49 papers with code • 5 benchmarks • 6 datasets
Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.
Source: Phrase Grounding by Soft-Label Chain Conditional Random Field
Libraries
Use these libraries to find Phrase Grounding models and implementationsMost implemented papers
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations.
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.
Towards Visual Grounding: A Survey
Finally, we outline the challenges confronting visual grounding and propose valuable directions for future research, which may serve as inspiration for subsequent researchers.
Grounding of Textual Phrases in Images by Reconstruction
We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.
Revisiting Image-Language Networks for Open-ended Phrase Detection
Most existing work that grounds natural language phrases in images starts with the assumption that the phrase in question is relevant to the image.
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
The Flickr30k dataset has become a standard benchmark for sentence-based image description.
Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing
We release a new dataset with locally-aligned phrase grounding annotations by radiologists to facilitate the study of complex semantic modelling in biomedical vision--language processing.
Kosmos-2: Grounding Multimodal Large Language Models to the World
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e. g., bounding boxes) and grounding text to the visual world.
An Open and Comprehensive Pipeline for Unified Object Grounding and Detection
Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC).
Natural Language Object Retrieval
In this paper, we address the task of natural language object retrieval, to localize a target object within a given image based on a natural language query of the object.