Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.
Ranked #4 on Phrase Grounding on Flickr30k Entities Test
We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.
Ranked #1 on Visual Question Answering on CLEVR-Humans
A phrase grounding system localizes a particular object in an image referred to by a natural language query.
Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions.
Most existing work that grounds natural language phrases in images starts with the assumption that the phrase in question is relevant to the image.
To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.
Following dedicated non-linear mappings for visual features at each level, word, and sentence embeddings, we obtain multiple instantiations of our common semantic space in which comparisons between any target text and the visual content is performed with cosine similarity.
Ranked #1 on Phrase Grounding on Visual Genome
In this paper, we formulate phrase grounding as a sequence labeling task where we treat candidate regions as potential labels, and use neural chain Conditional Random Fields (CRFs) to model dependencies among regions for adjacent mentions.