Collects dense per-video-shot concept annotations.
4 PAPERS • 1 BENCHMARK
VideoXum is an enriched large-scale dataset for cross-modal video summarization. The dataset is built on ActivityNet Captions. The datasets includes three subtasks: Video-to-Video Summarization (V2V-SUM), Video-to-Text Summarization (V2T-SUM), and Video-to-Video&Text Summarization (V2VT-SUM).
3 PAPERS • NO BENCHMARKS YET
MultiSum is a dataset for multimodal summarization (MSMO). It consists of 17 categories and 170 subcategories to encapsulate a diverse array of real-world scenarios. The dataset features:
1 PAPER • NO BENCHMARKS YET
A short clip of video may contain progression of multiple events and an interesting story line. A human needs to capture both the event in every shot and associate them together to understand the story behind it.
1 PAPER • 3 BENCHMARKS