Video Question Answering

153 papers with code • 20 benchmarks • 32 datasets

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Question Answering

Dataset	Best Model	Compare
ActivityNet-QA	GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)	See all
NExT-QA	VLAP (3B)	See all
MSRVTT-QA	Mirasol3B	See all
STAR Benchmark	VLAP (4 frames)	See all
MVBench	PLLaVA	See all
AGQA 2.0 balanced	GF (sup) - Faster RCNN	See all
iVQA	Text + Text (no Multimodal Pretext Training)	See all
MSRVTT-MC	VIOLETv2	See all
How2QA	Text + Text (no Multimodal Pretext Training)	See all
TVQA	LLaMA-VQA	See all
SUTD-TrafficQA	Tem-adapter	See all
WildQA	Multi (text + video, IO)	See all
LSMDC-MC	VIOLETv2	See all
Howto100M-QA	Hero w/ pre-training	See all
KnowIT VQA		See all
LSMDC-FiB	Clover	See all
MSR-VTT-MC	ATP (1<-16)	See all
DramaQA	LLaMA-VQA	See all
VLEP	LLaMA-VQA	See all
VideoQA	Just Ask (fine-tune)	See all

Show all 20 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Question Answering models and implementations

salesforce/lavis

2 papers

8,724

computer-vision-in-the-wild/cvinw_r…

2 papers

1,001

jpthu17/diffusionret

2 papers

pku-yuangroup/video-bench

2 papers

Datasets

Subtasks

Latest papers

Most implemented Social Latest No code

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

magic-research/PLLaVA • arXiv 2024

PLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks.

26 Apr 2024

Paper
Code

Listen Then See: Video Alignment with Speaker Attention

sts-vlcc/sts-vlcc • 21 Apr 2024

Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality.

21 Apr 2024

Paper
Code

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

boheumd/MA-LMM • • 8 Apr 2024

However, existing LLM-based large multimodal models (e. g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.

112

08 Apr 2024

Paper
Code

LongVLM: Efficient Long Video Understanding via Large Language Models

ziplab/longvlm • 4 Apr 2024

In this way, we encode video representations that incorporate both local and global information, enabling the LLM to generate comprehensive responses for long-term videos.

04 Apr 2024

Paper
Code

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

riflezhang/llava-hound-dpo • • 1 Apr 2024

Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM).

01 Apr 2024

Paper
Code

An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

imagegridworth/IG-VLM • • 27 Mar 2024

Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging.

27 Mar 2024

Paper
Code

OmniVid: A Generative Framework for Universal Video Understanding

wangjk666/omnivid • • 26 Mar 2024

The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution.

26 Mar 2024

Paper
Code

Elysium: Exploring Object-level Perception in Videos via MLLM

hon-wong/elysium • 25 Mar 2024

Multi-modal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains understudied.

25 Mar 2024

Paper
Code

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

mlvlab/vid-tldr • • 20 Mar 2024

To tackle these issues, we propose training free token merging for lightweight video Transformer (vid-TLDR) that aims to enhance the efficiency of video Transformers by merging the background tokens without additional training.

20 Mar 2024

Paper
Code

HawkEye: Training Video-Text LLMs for Grounding Text in Videos

yellow-binary-tree/hawkeye • • 15 Mar 2024

Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos.

15 Mar 2024

Paper
Code

Video Question Answering

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result