The goal of this dataset is to probe video-language models for understanding of simple temporal relations like "before" and "after". The dataset is only meant to be an evaluation set and not a training set.
2 PAPERS • 1 BENCHMARK
VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.
2 PAPERS • NO BENCHMARKS YET
Contains 5,193 video summaries of popular movies and TV series. SyMoN captures naturalistic storytelling videos for human audience made by human creators, and has higher story coverage and more frequent mental-state references than similar video-language story datasets.
3 PAPERS • NO BENCHMARKS YET