Progressive Video Summarization via Multimodal Self-supervised Learning

7 Jan 2022  ·  Li Haopeng, Ke Qiuhong, Gong Mingming, Tom Drummond ·

Modern video summarization methods are based on deep neural networks that require a large amount of annotated data for training. However, existing datasets for video summarization are small-scale, easily leading to over-fitting of the deep models. Considering that the annotation of large-scale datasets is time-consuming, we propose a multimodal self-supervised learning framework to obtain semantic representations of videos, which benefits the video summarization task. Specifically, the self-supervised learning is conducted by exploring the semantic consistency between the videos and text in both coarse-grained and fine-grained fashions, as well as recovering masked frames in the videos. The multimodal framework is trained on a newly-collected dataset that consists of video-text pairs. Additionally, we introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. Extensive experiments have proved the effectiveness and superiority of our method in rank correlation coefficients and F-score.

PDF Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Supervised Video Summarization SumMe SSPVS(+Text) F1-score (Canonical) 50.7 # 10
Kendall's Tau 0.192 # 4
Spearman's Rho 0.257 # 2
Supervised Video Summarization SumMe SSPVS F1-score (Canonical) 48.7 # 13
F1-score (Augmented) 50.4 # 6
Kendall's Tau 0.178 # 5
Spearman's Rho 0.240 # 3
Supervised Video Summarization TvSum SSPVS(+Text) F1-score (Canonical) 60.4 # 15
Kendall's Tau 0.181 # 4
Spearman's Rho 0.238 # 3
Supervised Video Summarization TvSum SSPVS F1-score (Canonical) 60.3 # 16
F1-score (Augmented) 61.8 # 6
Kendall's Tau 0.177 # 5
Spearman's Rho 0.233 # 4

Methods


No methods listed for this paper. Add relevant methods here