UAVM: Towards Unifying Audio and Visual Models

29 Jul 2022  ·  Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass ·

Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have.

PDF Abstract

Results from the Paper


Ranked #2 on Multi-modal Classification on AudioSet (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Multi-modal Classification AudioSet UAVM Average mAP 0.504 # 2
Audio Classification AudioSet UAVM (Audio + Video) Test mAP 0.504 # 6
Audio Classification VGGSound UAVM (Audio + Video) Top 1 Accuracy 65.8 # 6
Audio Classification VGGSound UAVM (Video Only) Top 1 Accuracy 49.9 # 19
Audio Classification VGGSound UAVM (Audio Only) Top 1 Accuracy 56.5 # 13
Multi-modal Classification VGG-Sound UAVM Top-1 Accuracy 65.8 # 3

Methods


No methods listed for this paper. Add relevant methods here