Video Question Answering

153 papers with code • 20 benchmarks • 32 datasets

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.

Libraries

Use these libraries to find Video Question Answering models and implementations

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

magic-research/PLLaVA arXiv 2024

PLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks.

2
26 Apr 2024

Listen Then See: Video Alignment with Speaker Attention

sts-vlcc/sts-vlcc 21 Apr 2024

Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality.

1
21 Apr 2024

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

boheumd/MA-LMM 8 Apr 2024

However, existing LLM-based large multimodal models (e. g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.

112
08 Apr 2024

LongVLM: Efficient Long Video Understanding via Large Language Models

ziplab/longvlm 4 Apr 2024

In this way, we encode video representations that incorporate both local and global information, enabling the LLM to generate comprehensive responses for long-term videos.

11
04 Apr 2024

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

riflezhang/llava-hound-dpo 1 Apr 2024

Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM).

43
01 Apr 2024

An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

imagegridworth/IG-VLM 27 Mar 2024

Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging.

62
27 Mar 2024

OmniVid: A Generative Framework for Universal Video Understanding

wangjk666/omnivid 26 Mar 2024

The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution.

18
26 Mar 2024

Elysium: Exploring Object-level Perception in Videos via MLLM

hon-wong/elysium 25 Mar 2024

Multi-modal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains understudied.

19
25 Mar 2024

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

mlvlab/vid-tldr 20 Mar 2024

To tackle these issues, we propose training free token merging for lightweight video Transformer (vid-TLDR) that aims to enhance the efficiency of video Transformers by merging the background tokens without additional training.

16
20 Mar 2024

HawkEye: Training Video-Text LLMs for Grounding Text in Videos

yellow-binary-tree/hawkeye 15 Mar 2024

Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos.

22
15 Mar 2024