Attention Bottlenecks for Multimodal Fusion

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Audio Classification AudioSet MBT (AS-500K training + Video) Test mAP 0.496 # 9
Action Recognition EPIC-KITCHENS-100 MBT Action@1 43.4 # 21
Verb@1 64.8 # 23
Noun@1 58 # 14
Action Classification Kinetics-400 MBT (AV) Acc@1 80.8 # 85
Acc@5 94.6 # 61
Action Classification Kinetics-Sounds MBT (AV) Top 1 Accuracy 85 # 2
Top 5 Accuracy 96.8 # 1
Action Classification MiT MBT (AV) Top 1 Accuracy 37.3 # 15
Top 5 Accuracy 61.2 # 10
Audio Classification VGGSound MBT (A) Top 1 Accuracy 52.3 # 17
Top 5 Accuracy 78.1 # 6
Audio Classification VGGSound MBT (V) Top 1 Accuracy 51.2 # 18
Top 5 Accuracy 72.6 # 9
Audio Classification VGGSound MBT (AV) Top 5 Accuracy 85.6 # 2

Methods


No methods listed for this paper. Add relevant methods here