Audiovisual Masked Autoencoders

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Results from the Paper


 Ranked #1 on Audio Classification on EPIC-KITCHENS-100 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Audio Classification AudioSet Audiovisual Masked Autoencoder (Audiovisual, Single) Test mAP 0.518 # 3
Audio Classification AudioSet Audiovisual Masked Autoencoder (Audio-only, Single) Test mAP 0.466 # 24
Audio Classification EPIC-KITCHENS-100 Audiovisual Masked Autoencoder (Audio-only, Single) Top-1 Verb 52.7 # 3
Top-1 Noun 27.2 # 3
Top-1 Action 19.7 # 3
Audio Classification EPIC-KITCHENS-100 Audiovisual Masked Autoencoder (Video-only, Single) Top-1 Verb 70.8 # 2
Top-1 Noun 55.9 # 2
Top-1 Action 45.8 # 2
Audio Classification EPIC-KITCHENS-100 Audiovisual Masked Autoencoder (Audiovisual, Single) Top-1 Verb 71.4 # 1
Top-1 Noun 56.4 # 1
Top-1 Action 46.0 # 1
Audio Classification VGGSound Audiovisual Masked Autoencoder (Audio-only, Single) Top 1 Accuracy 57.2 # 11
Audio Classification VGGSound Audiovisual Masked Autoencoder (Audiovisual, Single) Top 1 Accuracy 65.0 # 7

Methods


No methods listed for this paper. Add relevant methods here