Video Question Answering

130 papers with code • 18 benchmarks • 28 datasets

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.


Use these libraries to find Video Question Answering models and implementations

Most implemented papers

Is Space-Time Attention All You Need for Video Understanding?

facebookresearch/TimeSformer 9 Feb 2021

We present a convolution-free approach to video classification built exclusively on self-attention over space and time.

TVQA: Localized, Compositional Video Question Answering

jayleicn/TVQA EMNLP 2018

Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks.

Flamingo: a Visual Language Model for Few-Shot Learning

mlfoundations/open_flamingo DeepMind 2022

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

jpthu17/emcl 21 Nov 2022

Most video-and-language representation learning approaches employ contrastive learning, e. g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs.

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

alibaba/AliceMind 1 Feb 2023

In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

jpthu17/HBI CVPR 2023

Contrastive learning-based video-language representation learning approaches, e. g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

PKU-YuanGroup/Video-LLaVA 16 Nov 2023

In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.

TVQA+: Spatio-Temporal Grounding for Video Question Answering

jayleicn/TVQAplus ACL 2020

We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

linjieli222/HERO EMNLP 2020

We present HERO, a novel framework for large-scale video+language omni-representation learning.