Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting

18 Jun 2021  ·  Martine Toering, Ioannis Gatopoulos, Maarten Stol, Vincent Tao Hu ·

Instance-level contrastive learning techniques, which rely on data augmentation and a contrastive loss function, have found great success in the domain of visual representation learning. They are not suitable for exploiting the rich dynamical structure of video however, as operations are done on many augmented instances. In this paper we propose "Video Cross-Stream Prototypical Contrasting", a novel method which predicts consistent prototype assignments from both RGB and optical flow views, operating on sets of samples. Specifically, we alternate the optimization process; while optimizing one of the streams, all views are mapped to one set of stream prototype vectors. Each of the assignments is predicted with all views except the one matching the prediction, pushing representations closer to their assigned prototypes. As a result, more efficient video embeddings with ingrained motion information are learned, without the explicit need for optical flow computation during inference. We obtain state-of-the-art results on nearest-neighbour video retrieval and action recognition, outperforming previous best by +3.2% on UCF101 using the S3D backbone (90.5% Top-1 acc), and by +7.2% on UCF101 and +15.1% on HMDB51 using the R(2+1)D backbone.

PDF Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Self-Supervised Action Recognition HMDB51 ViCC (R2+1D; R+F) Top-1 Accuracy 61.5 # 17
Pre-Training Dataset UCF101 # 1
Frozen false # 1
Self-Supervised Action Recognition HMDB51 ViCC (R2+1D; RGB) Top-1 Accuracy 52.4 # 26
Pre-Training Dataset UCF101 # 1
Frozen false # 1
Self-Supervised Action Recognition HMDB51 ViCC (S3D; R+F) Top-1 Accuracy 62.2 # 16
Pre-Training Dataset UCF101 # 1
Frozen false # 1
Self-supervised Video Retrieval HMDB51 ViCC (R2+1D; R+F) Top-1 28.3 # 2
Pretrain UCF101 # 1
Self-supervised Video Retrieval HMDB51 ViCC (S3D; RGB) Top-1 25.5 # 3
Pretrain UCF101 # 1
Self-supervised Video Retrieval HMDB51 ViCC (S3D; R+F) Top-1 29.7 # 1
Pretrain UCF101 # 1
Self-supervised Video Retrieval HMDB51 ViCC (R2+1D; RGB) Top-1 25.3 # 4
Pretrain UCF101 # 1
Self-Supervised Action Recognition HMDB51 ViCC (S3D; RGB) Top-1 Accuracy 38.5 # 29
Pre-Training Dataset UCF101 # 1
Frozen true # 1
Self-Supervised Action Recognition HMDB51 (finetuned) ViCC (S3D; R+F) Top-1 Accuracy 62.2 # 9
Pretraining Dataset UCF101 # 1
Self-Supervised Action Recognition HMDB51 (finetuned) ViCC (S3D; RGB)) Top-1 Accuracy 47.9 # 14
Pretraining Dataset UCF101 # 1
Self-Supervised Action Recognition HMDB51 (finetuned) ViCC (R2+1D; RGB) Top-1 Accuracy 52.4 # 13
Pretraining Dataset UCF101 # 1
Self-supervised Video Retrieval UCF101 ViCC (S3D; RGB) Top-1 62.1 # 2
Pretrain UCF101 # 1
Self-supervised Video Retrieval UCF101 ViCC (R2+1D; R+F) Top-1 59.9 # 3
Pretrain UCF101 # 1
Self-supervised Video Retrieval UCF101 ViCC (R2+1D; RGB) Top-1 58.6 # 4
Pretrain UCF101 # 1
Self-Supervised Action Recognition UCF101 ViCC (S3D; R+F) 3-fold Accuracy 90.5 # 14
Pre-Training Dataset UCF101 # 1
Frozen false # 1
Self-Supervised Action Recognition UCF101 ViCC (S3D; RGB) 3-fold Accuracy 88.8 # 15
Pre-Training Dataset UCF101 # 1
Frozen false # 1
Self-Supervised Action Recognition UCF101 ViCC (R2+1D; RGB) 3-fold Accuracy 82.8 # 22
Pre-Training Dataset UCF101 # 1
Frozen false # 1
Self-Supervised Action Recognition UCF101 ViCC (R2+1D; R+F) 3-fold Accuracy 88.8 # 15
Pre-Training Dataset UCF101 # 1
Frozen false # 1
Self-Supervised Action Recognition UCF101 ViCC (S3D; RGB) 3-fold Accuracy 72.2 # 28
Pre-Training Dataset UCF101 # 1
Frozen true # 1
Self-supervised Video Retrieval UCF101 ViCC (S3D; R+F) Top-1 65.1 # 1
Pretrain UCF101 # 1
Self-Supervised Action Recognition UCF101 (finetuned) ViCC (S3D; RGB) 3-fold Accuracy 84.3 # 13
Pretrain UCF101 # 1
Self-Supervised Action Recognition UCF101 (finetuned) ViCC (R2+1D; R+F) 3-fold Accuracy 88.8 # 11
Pretrain UCF101 # 1
Self-Supervised Action Recognition UCF101 (finetuned) ViCC (S3D; R+F) 3-fold Accuracy 90.5 # 9
Pretrain UCF101 # 1
Self-Supervised Action Recognition UCF101 (finetuned) ViCC (R2+1D; RGB) 3-fold Accuracy 82.8 # 14
Pretrain UCF101 # 1

Methods