A Two-Stage Framework to Generate Video Chapter

29 Sep 2021  ·  Canyu Le, Zhiyuan Tang, Ke Li, Jiandong Yang ·

We aim to address the problem of video chapter generation. Compared to traditional video activity analysis, this task is significantly different. The videos in chapter generation are much longer and contain many complex temporal structures. Moreover, the association between video frames and narrations plays a crucial role in expressing underlying information. To facilitate the research along this direction, we introduce a large-scale dataset called ChapterGen, which consists of approximately $10k$ user-generated videos with annotated chapter descriptions. Our data collection procedure is fast, scalable, and does not require any additional manual annotation. On top of this dataset, we propose a two-stage framework to perform chapter localization and chapter title generation. This framework captures two aspects of a video, including visual dynamics and narration text. To parse the whole video efficiently, we build the framework based on a flexible clip sliding window. Our experiments demonstrate that the proposed framework achieves superior results over existing methods on both accuracy and efficiency.

PDF Abstract
No code implementations yet. Submit your code now

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here