Video Question Answering
130 papers with code • 18 benchmarks • 28 datasets
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.
Most video-and-language representation learning approaches employ contrastive learning, e. g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs.
In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
Contrastive learning-based video-language representation learning approaches, e. g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs.
In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.
We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.