Video Question Answering
153 papers with code • 20 benchmarks • 32 datasets
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.
Libraries
Use these libraries to find Video Question Answering models and implementationsLatest papers
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
PLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks.
Listen Then See: Video Alignment with Speaker Attention
Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality.
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
However, existing LLM-based large multimodal models (e. g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.
LongVLM: Efficient Long Video Understanding via Large Language Models
In this way, we encode video representations that incorporate both local and global information, enabling the LLM to generate comprehensive responses for long-term videos.
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward
Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM).
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging.
OmniVid: A Generative Framework for Universal Video Understanding
The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution.
Elysium: Exploring Object-level Perception in Videos via MLLM
Multi-modal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains understudied.
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
To tackle these issues, we propose training free token merging for lightweight video Transformer (vid-TLDR) that aims to enhance the efficiency of video Transformers by merging the background tokens without additional training.
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos.