Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

9 Nov 2021  ยท  Pritam Sarkar, Ali Etemad ยท

We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations, CrissCross also learns 'asynchronous' cross-modal relationships. We perform in-depth studies showing that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations useful for a variety of downstream tasks. To pretrain our proposed solution, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and action retrieval. Our experiments show that CrissCross either outperforms or achieves performances on par with the current state-of-the-art self-supervised methods on action recognition and action retrieval with UCF101 and HMDB51, as well as sound classification with ESC50 and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining while pretrained on Kinetics-Sound. The codes and pretrained models are available on the project website.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Audio Classification DCASE CrissCross (AudioSet) Top-1 Accuracy 97 # 1
PRE-TRAINING DATASET AudioSet # 1
Audio Classification DCASE CrissCross (Kinetics-400) Top-1 Accuracy 96 # 2
PRE-TRAINING DATASET Kinetics-400 # 1
Audio Classification DCASE CrissCross (Kinetics-Sound) Top-1 Accuracy 93 # 5
PRE-TRAINING DATASET Kinetics-Sound # 1
Self-Supervised Audio Classification ESC-50 CrissCross (AudioSet) Top-1 Accuracy 90.5 # 2
Self-Supervised Audio Classification ESC-50 CrissCross (Kinetics400) Top-1 Accuracy 86.8 # 4
Self-Supervised Action Recognition HMDB51 CrissCross (Kinetics-Sound) Top-1 Accuracy 60.5 # 26
Pre-Training Dataset Kinetics-Sound # 1
Frozen false # 1
Self-supervised Video Retrieval HMDB51 CrissCross (R2+1D) Top-1 26.4 # 6
Pretrain Kinetics400 # 1
Self-Supervised Action Recognition HMDB51 CrissCross (AudioSet) Top-1 Accuracy 66.8 # 11
Pre-Training Dataset AudioSet # 1
Frozen false # 1
Self-Supervised Action Recognition HMDB51 CrissCross (Kinetics400) Top-1 Accuracy 64.7 # 16
Pre-Training Dataset Kinetics400 # 1
Frozen false # 1
Self-Supervised Action Recognition UCF101 CrissCross (Kinetics400) 3-fold Accuracy 91.5 # 18
Pre-Training Dataset Kinetics400 # 1
Frozen false # 1
Self-Supervised Action Recognition UCF101 CrissCross (Kinetics-Sound) 3-fold Accuracy 88.3 # 25
Pre-Training Dataset Kinetics-Sound # 1
Frozen false # 1
Self-supervised Video Retrieval UCF101 CrissCross (R2+1D) Top-1 63.8 # 5
Pretrain Kinetics400 # 1
Self-Supervised Action Recognition UCF101 CrissCross (AudioSet) 3-fold Accuracy 92.4 # 16
Pre-Training Dataset AudioSet # 1
Frozen false # 1

Methods


No methods listed for this paper. Add relevant methods here