…Mechanical Turk (AMT) is used to collect annotations on HowTo100M videos. 30k 60-second clips are randomly sampled from 9,421 videos and present each clip to the turkers, who are asked to select a video segment After this segment selection step, another group of workers are asked to write descriptions for each displayed segment. These final video segments are 10-20 seconds long on average, and the length of queries ranges from 8 to 20 words.
9 PAPERS • NO BENCHMARKS YET
The ViTT dataset consists of human produced segment-level annotations for 8,169 videos. Of these, 5,840 videos have been annotated once, and the rest of the videos have been annotated twice or more.
11 PAPERS • 2 BENCHMARKS
…Each worker is assigned with one video segment and asked to write one question with four answer candidates (one correctand three distractors).
22 PAPERS • 2 BENCHMARKS