Video Question Answering

39 papers with code • 6 benchmarks • 17 datasets

This task has no description! Would you like to contribute one?

Greatest papers with code

Is Space-Time Attention All You Need for Video Understanding?

open-mmlab/mmaction2 9 Feb 2021

We present a convolution-free approach to video classification built exclusively on self-attention over space and time.

Action Classification Action Recognition +3

OmniNet: A unified architecture for multi-modal multi-task learning

subho406/OmniNet 17 Jul 2019

We also show that using this neural network pre-trained on some modalities assists in learning unseen tasks such as video captioning and video question answering.

Image Captioning Language understanding +6

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

jayleicn/ClipBERT CVPR 2021

Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle.

Ranked #2 on Visual Question Answering on MSRVTT-QA (using extra training data)

Question Answering Video Question Answering +2

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

antoine77340/howto100m ECCV 2018

We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e. g. a video clip and a language sentence).

Question Answering Semantic Similarity +4

TVQA: Localized, Compositional Video Question Answering

jayleicn/TVQA EMNLP 2018

Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks.

Video Question Answering

Hierarchical Conditional Relation Networks for Video Question Answering

thaolmk54/hcrn-videoqa CVPR 2020

Video question answering (VideoQA) is challenging as it requires modeling capacity to distill dynamic visual artifacts and distant relations and to associate them with linguistic concepts.

Question Answering Video Question Answering +1

TVQA+: Spatio-Temporal Grounding for Video Question Answering

jayleicn/TVQA-PLUS ACL 2020

We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.

Question Answering Video Question Answering

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

VALUE-Leaderboard/StarterCode 8 Jun 2021

Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task.

Language understanding Multi-Task Learning +4