Video Question Answering

67 papers with code • 11 benchmarks • 24 datasets

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.

Most implemented papers

Is Space-Time Attention All You Need for Video Understanding?

facebookresearch/TimeSformer 9 Feb 2021

We present a convolution-free approach to video classification built exclusively on self-attention over space and time.

TVQA: Localized, Compositional Video Question Answering

jayleicn/TVQA EMNLP 2018

Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks.

TVQA+: Spatio-Temporal Grounding for Video Question Answering

jayleicn/TVQAplus ACL 2020

We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

linjieli222/HERO EMNLP 2020

We present HERO, a novel framework for large-scale video+language omni-representation learning.

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

antoine77340/howto100m ECCV 2018

We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e. g. a video clip and a language sentence).

OmniNet: A unified architecture for multi-modal multi-task learning

subho406/OmniNet 17 Jul 2019

We also show that using this neural network pre-trained on some modalities assists in learning unseen tasks such as video captioning and video question answering.

A Better Way to Attend: Attention with Trees for Video Question Answering

xuehy/TreeAttention 5 Sep 2019

We propose a new attention model for video question answering.

TutorialVQA: Question Answering Dataset for Tutorial Videos

acolas1/TutorialVQAData LREC 2020

The results indicate that the task is challenging and call for the investigation of new algorithms.

SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events


In this paper, we create a novel dataset, SUTD-TrafficQA (Traffic Question Answering), which takes the form of video QA based on the collected 10, 080 in-the-wild videos and annotated 62, 535 QA pairs, for benchmarking the cognitive capability of causal inference and event understanding models in complex traffic scenarios.