VideoInstruct (Video Instruction Dataset)

Introduced by Maaz et al. in Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Video Instruction Dataset is used to train Video-ChatGPT. It consists of 100,000 high-quality video instruction pairs. employs a combination of human-assisted and semi-automatic annotation techniques, aiming to produce high-quality video instruction data. These methods create question-answer pairs related to

Video summarization
Description-based question-answers (exploring spatial, temporal, relationships, and reasoning concepts)
Creative/generative question-answers

Homepage

Benchmarks

Add a new result Link an existing benchmark

Task	Dataset Variant	Best Model
Video-based Generative Performance Benchmarking	VideoInstruct	VLM-RLAIF
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	PLLaVA
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	PLLaVA
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	PLLaVA
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	PLLaVA
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	ST-LLM