To collect How2QA for video QA task, the same set of selected video clips are presented to another group of AMT workers for multichoice QA annotation. Each worker is assigned with one video segment and asked to write one question with four answer candidates (one correctand three distractors). Similarly, narrations are hidden from the workers to ensure the collected QA pairs are not biased by subtitles. Similar to TVQA, the start and end points are provided for the relevant moment for each question. After filtering low-quality annotations, the final dataset contains 44,007 QA pairs for 22k 60-second clips selected from 9035 videos.

Source: HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training


Paper Code Results Date Stars

Dataset Loaders


Similar Datasets


  • Unknown