Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, AudioSet and ESC-50 when compared to previous self-supervised work. Our models are publicly available.

PDF Abstract NeurIPS 2020 PDF NeurIPS 2020 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Self-Supervised Action Recognition UCF101 MMV TSM-50x2 3-fold Accuracy 95.2 # 3
Pre-Training Dataset Audioset + Howto100M # 1
Frozen false # 1
Self-Supervised Action Recognition UCF101 (finetuned) MMV 3-fold Accuracy 91.5 # 7

Results from Other Papers

Task Dataset Model Metric Name Metric Value Rank Source Paper Compare
Audio Classification AudioSet MMV Test mAP 0.309 # 19
Self-Supervised Audio Classification AudioSet (MLP) MMV Top-1 Accuracy 29.7 # 2
Self-Supervised Audio Classification ESC-50 MMV Top-1 Accuracy 85.6 # 5
Self-Supervised Action Recognition HMDB51 (finetuned) MMV Top-1 Accuracy 70.1 # 2
Self-Supervised Action Recognition Kinetics-600 MMV Top-1 Accuracy 55.5 # 5