Phrase Grounding

49 papers with code • 5 benchmarks • 6 datasets

Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.

Source: Phrase Grounding by Soft-Label Chain Conditional Random Field

Libraries

Use these libraries to find Phrase Grounding models and implementations
2 papers
2,361

Most implemented papers

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

akirafukui/vqa-mcb EMNLP 2016

Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations.

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

ashkamath/mdetr 26 Apr 2021

We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.

Towards Visual Grounding: A Survey

linhuixiao/awesome-visual-grounding 28 Dec 2024

Finally, we outline the challenges confronting visual grounding and propose valuable directions for future research, which may serve as inspiration for subsequent researchers.

Grounding of Textual Phrases in Images by Reconstruction

akirafukui/vqa-mcb 12 Nov 2015

We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.

Revisiting Image-Language Networks for Open-ended Phrase Detection

BryanPlummer/phrase_detection 17 Nov 2018

Most existing work that grounds natural language phrases in images starts with the assumption that the phrase in question is relevant to the image.

Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing

microsoft/hi-ml 21 Apr 2022

We release a new dataset with locally-aligned phrase grounding annotations by radiologists to facilitate the study of complex semantic modelling in biomedical vision--language processing.

Kosmos-2: Grounding Multimodal Large Language Models to the World

microsoft/unilm 26 Jun 2023

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e. g., bounding boxes) and grounding text to the visual world.

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

open-mmlab/mmdetection 4 Jan 2024

Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC).

Natural Language Object Retrieval

ronghanghu/natural-language-object-retrieval CVPR 2016

In this paper, we address the task of natural language object retrieval, to localize a target object within a given image based on a natural language query of the object.