Charades-STA is a new dataset built on top of Charades by adding sentence temporal annotations.
184 PAPERS • 4 BENCHMARKS
To investigate three temporal localization tasks: supervised and weakly-supervised audio-visual event localization, and cross-modality localization.
84 PAPERS • NO BENCHMARKS YET
The VidSTG dataset is a spatio-temporal video grounding dataset constructed based on the video relation dataset VidOR. VidOR contains 7,000, 835 and 2,165 videos for training, validation and testing, respectively. The goal of the Spatio-Temporal Video Grounding task (STVG) is to localize the spatio-temporal section of an untrimmed video that matches a given sentence depicting an object.
20 PAPERS • 1 BENCHMARK