Video-based Generative Performance Benchmarking

16 papers with code • 1 benchmarks • 1 datasets

The benchmark evaluates a generative Video Conversational Model and covers five key aspects:

Correctness of Information
Detailed Orientation
Contextual Understanding
Temporal Understanding
Consistency

We curate a test set based on the ActivityNet-200 dataset, featuring videos with rich, dense descriptive captions and associated question-answer pairs from human annotations. We develop an evaluation pipeline using the GPT-3.5 model that assigns a relative score to the generated predictions on a scale of 1-5.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video-based Generative Performance Benchmarking

Trend	Dataset	Best Model	Paper	Code	Compare
	VideoInstruct	VLM-RLAIF			See all

Datasets

VideoInstruct

Subtasks

Video-based Generative Performance Benchmarking (Temporal Understanding)

Video-based Generative Performance Benchmarking (Consistency)

Most implemented papers

Most implemented Social Latest No code

Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

yonseivnl/vlm-rlaif • 6 Feb 2024

Recent advancements in large language models have influenced the development of video large multimodal models (VLMMs).

Paper
Code

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

rikeilong/bay-cat • 7 Mar 2024

This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components.

Paper
Code

An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

imagegridworth/IG-VLM • • 27 Mar 2024

Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging.

Paper
Code

LITA: Language Instructed Temporal-Localization Assistant

nvlabs/lita • • 27 Mar 2024

In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for learning and evaluating this task.

Paper
Code

ST-LLM: Large Language Models Are Effective Temporal Learners

TencentARC/ST-LLM • • 30 Mar 2024

In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs?

Paper
Code

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

magic-research/PLLaVA • • arXiv 2024

PLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks.

Paper
Code

Video-based Generative Performance Benchmarking

Benchmarks Add a Result

Datasets

Subtasks

Most implemented papers

Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

LITA: Language Instructed Temporal-Localization Assistant

ST-LLM: Large Language Models Are Effective Temporal Learners

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Content

Benchmarks

Add a Result