Is Space-Time Attention All You Need for Video Understanding?

open-mmlab/mmaction2 9 Feb 2021

We present a convolution-free approach to video classification built exclusively on self-attention over space and time.

OmniNet: A unified architecture for multi-modal multi-task learning

subho406/OmniNet 17 Jul 2019

We also show that using this neural network pre-trained on some modalities assists in learning unseen tasks such as video captioning and video question answering.

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

jayleicn/ClipBERT CVPR 2021

Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle.

Ranked #2 on Visual Question Answering on MSRVTT-QA (using extra training data)

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

antoine77340/howto100m ECCV 2018

We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e. g. a video clip and a language sentence).

TVQA: Localized, Compositional Video Question Answering

jayleicn/TVQA EMNLP 2018

Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks.

Hierarchical Conditional Relation Networks for Video Question Answering

thaolmk54/hcrn-videoqa CVPR 2020

Video question answering (VideoQA) is challenging as it requires modeling capacity to distill dynamic visual artifacts and distant relations and to associate them with linguistic concepts.

TVQA+: Spatio-Temporal Grounding for Video Question Answering

jayleicn/TVQA-PLUS ACL 2020

We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

VALUE-Leaderboard/StarterCode 8 Jun 2021

Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task.

