Video-based Generative Performance Benchmarking

16 papers with code • 1 benchmarks • 1 datasets

The benchmark evaluates a generative Video Conversational Model and covers five key aspects:

  • Correctness of Information
  • Detailed Orientation
  • Contextual Understanding
  • Temporal Understanding
  • Consistency

We curate a test set based on the ActivityNet-200 dataset, featuring videos with rich, dense descriptive captions and associated question-answer pairs from human annotations. We develop an evaluation pipeline using the GPT-3.5 model that assigns a relative score to the generated predictions on a scale of 1-5.

Most implemented papers

Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

yonseivnl/vlm-rlaif 6 Feb 2024

Recent advancements in large language models have influenced the development of video large multimodal models (VLMMs).

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

rikeilong/bay-cat 7 Mar 2024

This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components.

An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

imagegridworth/IG-VLM 27 Mar 2024

Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging.

LITA: Language Instructed Temporal-Localization Assistant

nvlabs/lita 27 Mar 2024

In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for learning and evaluating this task.

ST-LLM: Large Language Models Are Effective Temporal Learners

TencentARC/ST-LLM 30 Mar 2024

In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs?

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

magic-research/PLLaVA arXiv 2024

PLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks.