WildQA: In-the-Wild Video Question Answering

14 Sep 2022  ·  Santiago Castro, Naihao Deng, Pingxuan Huang, Mihai Burzo, Rada Mihalcea ·

Existing video understanding datasets mostly focus on human interactions, with little attention being paid to the "in the wild" settings, where the videos are recorded outdoors. We propose WILDQA, a video understanding dataset of videos recorded in outside settings. In addition to video question answering (Video QA), we also introduce the new task of identifying visual support for a given question and answer (Video Evidence Selection). Through evaluations using a wide range of baseline models, we show that WILDQA poses new challenges to the vision and language research communities. The dataset is available at https://lit.eecs.umich.edu/wildqa/.

PDF Abstract

Datasets


Introduced in the Paper:

WildQA

Used in the Paper:

TVQA MovieQA TVQA+ TutorialVQA

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Question Answering WildQA Multi (text + video, SE) ROUGE-1 33.8 ± 0.8 # 2
ROUGE-2 18.5 ± 0.7 # 2
ROUGE-L 32.5 ± 0.8 # 2
Video Question Answering WildQA Multi (text + video, IO) ROUGE-1 34.0 ± 0.5 # 1
ROUGE-2 18.8 ± 0.7 # 1
ROUGE-L 32.8 ± 0.6 # 1
Video Question Answering WildQA T5 (text + video) ROUGE-1 33.1 ± 0.3 # 4
ROUGE-2 17.3 ± 0.4 # 4
ROUGE-L 31.9 ± 0.2 # 4
Video Question Answering WildQA T5 (text) ROUGE-1 33.8 ± 0.2 # 2
ROUGE-2 17.7 ± 0.1 # 3
ROUGE-L 32.4 ± 0.3 # 3
Video Question Answering WildQA T5 (text, zero-shot) ROUGE-1 0.8 ± 0.0 # 5
ROUGE-2 0.0 ± 0.0 # 5
ROUGE-L 0.8 ± 0.0 # 5

Methods


No methods listed for this paper. Add relevant methods here