9 dataset results for Video Understanding AND English

SoccerNet-Echoes

SoccerNet-Echoes (SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset)

SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset.

1 PAPER • NO BENCHMARKS YET

Vript (🎬 Vript: Refine Video Captioning into Video Scripting)

We construct a fine-grained video-text dataset with 12K annotated high-resolution videos (~400k clips). The annotation of this dataset is inspired by the video script. If we want to make a video, we have to first write a script to organize how to shoot the scenes in the videos. To shoot a scene, we need to decide the content, shot type (medium shot, close-up, etc), and how the camera moves (panning, tilting, etc). Therefore, we extend video captioning to video scripting by annotating the videos in the format of video scripts. Different from the previous video-text datasets, we densely annotate the entire videos without discarding any scenes and each scene has a caption with ~145 words. Besides the vision modality, we transcribe the voice-over into text and put it along with the video title to give more background information for annotating the videos.

0 PAPER • NO BENCHMARKS YET

VTC (Videos, Titles and Comments)

VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.

2 PAPERS • NO BENCHMARKS YET

Kinetics-GEB+

Kinetics-GEB+ (Generic Event Boundary Captioning, Grounding and Retrieval) is a dataset that consists of over 170k boundaries associated with captions describing status changes in the generic events in 12K videos.

1 PAPER • 3 BENCHMARKS

ImageCoDe (Image Retrieval from Contextual Descriptions)

Given 10 minimally contrastive (highly similar) images and a complex description for one of them, the task is to retrieve the correct image. The source of most images are videos and descriptions as well as retrievals come from human.

9 PAPERS • 1 BENCHMARK

STAR Benchmark (Situated Reasoning)

How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. STAR Benchmark is a novel benchmark for Situated Reasoning, which provides 60K challenging situated questions in four types of tasks, 140K situated hypergraphs, symbolic situation programs, and logic-grounded diagnosis for real-world video situations. (Data Download, STAR Leaderboard)

13 PAPERS • 3 BENCHMARKS

C3D features for PHD2GIF

The feature files are named with the youtube IDs. https://drive.google.com/drive/folders/10-6hkQxMKMGwLXANxfPRE7xw5PKiMjLn?usp=sharing

1 PAPER • NO BENCHMARKS YET

VidSitu

VidSitu is a dataset for the task of semantic role labeling in videos (VidSRL). It is a large-scale video understanding data source with 29K 10-second movie clips richly annotated with a verb and semantic-roles every 2 seconds. Entities are co-referenced across events within a movie clip and events are connected to each other via event-event relations. Clips in VidSitu are drawn from a large collection of movies (∼3K) and have been chosen to be both complex (∼4.2 unique verbs within a video) as well as diverse (∼200 verbs have more than 100 annotations each).

13 PAPERS • NO BENCHMARKS YET

ShanghaiTech Campus

The ShanghaiTech Campus dataset has 13 scenes with complex light conditions and camera angles. It contains 130 abnormal events and over 270, 000 training frames. Moreover, both the frame-level and pixel-level ground truth of abnormal events are annotated in this dataset.

167 PAPERS • 5 BENCHMARKS

Datasets

9 dataset results for Video Understanding AND English