3 dataset results for segmentation AND Video Captioning AND Videos

…Mechanical Turk (AMT) is used to collect annotations on HowTo100M videos. 30k 60-second clips are randomly sampled from 9,421 videos and present each clip to the turkers, who are asked to select a video segment After this segment selection step, another group of workers are asked to write descriptions for each displayed segment. These final video segments are 10-20 seconds long on average, and the length of queries ranges from 8 to 20 words.

9 PAPERS • NO BENCHMARKS YET

ViTT (Video Timeline Tags)

The ViTT dataset consists of human produced segment-level annotations for 8,169 videos. Of these, 5,840 videos have been annotated once, and the rest of the videos have been annotated twice or more.

11 PAPERS • 2 BENCHMARKS

How2QA

…Each worker is assigned with one video segment and asked to write one question with four answer candidates (one correctand three distractors).

22 PAPERS • 2 BENCHMARKS

Datasets

3 dataset results for segmentation AND Video Captioning AND Videos