Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

NeurIPS 2018  ·  Bruno Korbar, Du Tran, Lorenzo Torresani ·

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.

PDF Abstract NeurIPS 2018 PDF NeurIPS 2018 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Self-Supervised Audio Classification ESC-50 AVTS Top-1 Accuracy 80.6 # 7
Audio Classification ESC-50 AVTS Top-1 Accuracy 82.3 # 21
Self-Supervised Action Recognition HMDB51 (finetuned) AVTS Top-1 Accuracy 61.6 # 10
Self-Supervised Action Recognition UCF101 (finetuned) AVTS 3-fold Accuracy 89.0 # 10

Methods


No methods listed for this paper. Add relevant methods here