Self-Supervised Learning of Compressed Video Representations
Self-supervised learning of video representations has recently received great attention. Existing methods typically require frames to be decoded before being processed, which increases compute and storage requirements and ultimately hinders large-scale training. In this work, we propose a self-supervised approach to learn compressed video representations, eliminating the expensive decoding step. We use a three-stream video architecture that directly encodes I-frames and P-frames of a compressed video. Unlike existing approaches that encode I-frames and P-frames individually, we propose to jointly encode them by establishing bidirectional dynamic connections across streams. We show that our approach outperforms existing compressed video approaches in the supervised regime while maintaining computational efficiency. To enable self-supervised learning, we propose two pretext tasks that leverage the multimodal nature (RGB, motion vector, residuals) and the internal GOP structure of compressed videos. The first task asks our network to predict zeroth-order motion statistics in a spatio-temporal pyramid; the second task asks correspondence types between I-frames and P-frames after applying temporal transformation. We show that our approach achieves competitive performance on self-supervised learning of video representations with a considerable improvement in speed compared to the traditional methods.
PDF Abstract