Video Question Answering
132 papers with code • 19 benchmarks • 28 datasets
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.
Libraries
Use these libraries to find Video Question Answering models and implementationsMost implemented papers
Is Space-Time Attention All You Need for Video Understanding?
We present a convolution-free approach to video classification built exclusively on self-attention over space and time.
TVQA: Localized, Compositional Video Question Answering
Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks.
Flamingo: a Visual Language Model for Few-Shot Learning
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
Most video-and-language representation learning approaches employ contrastive learning, e. g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs.
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
Contrastive learning-based video-language representation learning approaches, e. g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.
Exploring Models and Data for Image Question Answering
A suite of baseline results on this new dataset are also presented.
TVQA+: Spatio-Temporal Grounding for Video Question Answering
We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
We present HERO, a novel framework for large-scale video+language omni-representation learning.