Person-centric Visual Grounding
4 papers with code • 1 benchmarks • 1 datasets
Person-centric visual grounding is the problem of linking between people named in a caption and people pictured in an image. Introduced in "Who's Waldo? Linking People Across Text and Images" (Cui et al, ICCV 2021).
We present a task and benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image.
We find that the original Who's Waldo dataset compiled for this task contains a large number of biased samples that are solvable simply by heuristic methods; for instance, in many cases the first name in the sentence corresponds to the largest bounding box, or the sequence of names in the sentence corresponds to an exact left-to-right order in the image.