Person-centric Visual Grounding

4 papers with code • 1 benchmarks • 1 datasets

Person-centric visual grounding is the problem of linking between people named in a caption and people pictured in an image. Introduced in "Who's Waldo? Linking People Across Text and Images" (Cui et al, ICCV 2021).

Most implemented papers

Who's Waldo? Linking People Across Text and Images

clairecyq/whos-waldo ICCV 2021

We present a task and benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image.

TubeDETR: Spatio-Temporal Video Grounding with Transformers

antoyang/TubeDETR 30 Mar 2022

We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query.

To Find Waldo You Need Contextual Cues: Debiasing Who's Waldo

fpsluozi/tofindwaldo 30 Mar 2022

We find that the original Who's Waldo dataset compiled for this task contains a large number of biased samples that are solvable simply by heuristic methods; for instance, in many cases the first name in the sentence corresponds to the largest bounding box, or the sequence of names in the sentence corresponds to an exact left-to-right order in the image.