14 papers with code • 5 benchmarks • 3 datasets
Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.
We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.
Most existing work that grounds natural language phrases in images starts with the assumption that the phrase in question is relevant to the image.
Following dedicated non-linear mappings for visual features at each level, word, and sentence embeddings, we obtain multiple instantiations of our common semantic space in which comparisons between any target text and the visual content is performed with cosine similarity.
Computer Vision applications often require a textual grounding module with precision, interpretability, and resilience to counterfactual inputs/queries.
In this paper, we formulate phrase grounding as a sequence labeling task where we treat candidate regions as potential labels, and use neural chain Conditional Random Fields (CRFs) to model dependencies among regions for adjacent mentions.
To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.
Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions.