Grounding Referring Expressions in Images by Variational Context

CVPR 2018 Hanwang ZhangYulei NiuShih-Fu Chang

We focus on grounding (i.e., localizing or linking) referring expressions in images, e.g., "largest elephant standing behind baby elephant". This is a general yet challenging vision-language task since it does not only require the localization of objects, but also the multimodal comprehension of context --- visual attributes (e.g., "largest", "baby") and relationships (e.g., "behind") that help to distinguish the referent from other objects, especially those of the same category... (read more)

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.