Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.

PDF Abstract NeurIPS 2020 PDF NeurIPS 2020 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Audio Classification DCASE XDC Top-1 Accuracy 95 # 3
PRE-TRAINING DATASET AudioSet # 1
Top-1 Accuracy 95 # 3
PRE-TRAINING DATASET IG-Random # 1
Audio Classification ESC-50 XDC Top-1 Accuracy 85.4 # 19
PRE-TRAINING DATASET IG-Random # 1
Top-1 Accuracy 84.8 # 20
PRE-TRAINING DATASET AudioSet # 1
Self-Supervised Action Recognition HMDB51 XDC Top-1 Accuracy 66.5 # 13
Pre-Training Dataset IG-Random # 1
Frozen false # 1
Top-1 Accuracy 52.6 # 32
Pre-Training Dataset Kinetics400 # 1
Frozen false # 1
Top-1 Accuracy 63.7 # 21
Pre-Training Dataset AudioSet # 1
Frozen false # 1
Top-1 Accuracy 68.9 # 9
Pre-Training Dataset IG-Kinetics # 1
Frozen false # 1
Self-Supervised Action Recognition HMDB51 (finetuned) XDC Top-1 Accuracy 68.9 # 4
Self-Supervised Action Recognition UCF101 (finetuned) XDC 3-fold Accuracy 95.5 # 2

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Source Paper Compare
Self-Supervised Audio Classification ESC-50 XDC Top-1 Accuracy 85.4 # 6

Methods


No methods listed for this paper. Add relevant methods here