Video-based Generative Performance Benchmarking
16 papers with code • 1 benchmarks • 1 datasets
The benchmark evaluates a generative Video Conversational Model and covers five key aspects:
- Correctness of Information
- Detailed Orientation
- Contextual Understanding
- Temporal Understanding
- Consistency
We curate a test set based on the ActivityNet-200 dataset, featuring videos with rich, dense descriptive captions and associated question-answer pairs from human annotations. We develop an evaluation pipeline using the GPT-3.5 model that assigns a relative score to the generated predictions on a scale of 1-5.
Subtasks
Most implemented papers
Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback
Recent advancements in large language models have influenced the development of video large multimodal models (VLMMs).
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components.
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging.
LITA: Language Instructed Temporal-Localization Assistant
In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for learning and evaluating this task.
ST-LLM: Large Language Models Are Effective Temporal Learners
In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs?
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
PLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks.