|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
A phrase grounding system localizes a particular object in an image referred to by a natural language query.
Most existing work that grounds natural language phrases in images starts with the assumption that the phrase in question is relevant to the image.
Following dedicated non-linear mappings for visual features at each level, word, and sentence embeddings, we obtain multiple instantiations of our common semantic space in which comparisons between any target text and the visual content is performed with cosine similarity.
SOTA for Phrase Grounding on Visual Genome
To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.
In this paper, we formulate phrase grounding as a sequence labeling task where we treat candidate regions as potential labels, and use neural chain Conditional Random Fields (CRFs) to model dependencies among regions for adjacent mentions.