Phrase Grounding

14 papers with code • 5 benchmarks • 3 datasets

Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.

Source: Phrase Grounding by Soft-Label Chain Conditional Random Field

Most implemented papers

Grounding of Textual Phrases in Images by Reconstruction

akirafukui/vqa-mcb 12 Nov 2015

We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.

Revisiting Image-Language Networks for Open-ended Phrase Detection

BryanPlummer/phrase_detection 17 Nov 2018

Most existing work that grounds natural language phrases in images starts with the assumption that the phrase in question is relevant to the image.

Conditional Image-Text Embedding Networks

BryanPlummer/cite ECCV 2018

This paper presents an approach for grounding phrases in images which jointly learns multiple text-conditioned embeddings in a single end-to-end model.

Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding

hassanhub/MultiGrounding CVPR 2019

Following dedicated non-linear mappings for visual features at each level, word, and sentence embeddings, we obtain multiple instantiations of our common semantic space in which comparisons between any target text and the visual content is performed with cosine similarity.

Modularized Textual Grounding for Counterfactual Resilience

jacobswan1/MTG-pytorch CVPR 2019

Computer Vision applications often require a textual grounding module with precision, interpretability, and resilience to counterfactual inputs/queries.

Zero-Shot Grounding of Objects from Natural Language Queries

TheShadow29/zsgnet-pytorch ICCV 2019

A phrase grounding system localizes a particular object in an image referred to by a natural language query.

Phrase Grounding by Soft-Label Chain Conditional Random Field

liujch1998/SoftLabelCCRF IJCNLP 2019

In this paper, we formulate phrase grounding as a sequence labeling task where we treat candidate regions as potential labels, and use neural chain Conditional Random Fields (CRFs) to model dependencies among regions for adjacent mentions.

Learning Cross-modal Context Graph for Visual Grounding

youngfly11/LCMCG-PyTorch AAAI-2020 2020

To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.

Contrastive Learning for Weakly Supervised Phrase Grounding

BigRedT/info-ground ECCV 2020

Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions.

MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding

qinzzz/Multimodal-Alignment-Framework EMNLP 2020

Phrase localization is a task that studies the mapping from textual phrases to regions of an image.