Multimodal Token Fusion for Vision Transformers

Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the inner-modal attentive weights may also be diluted, which could thus undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images. Our code is available at https://github.com/yikaiw/TokenFusion.

PDF Abstract journal 2022 PDF

Results from the Paper


 Ranked #1 on Semantic Segmentation on SUN-RGBD (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semantic Segmentation DeLiVER TokenFusion (RGB-LiDAR) mIoU 53.01 # 6
Semantic Segmentation DeLiVER TokenFusion (RGB-Depth) mIoU 60.25 # 3
Semantic Segmentation DeLiVER TokenFusion (RGB-Event) mIoU 45.63 # 12
Semantic Segmentation KITTI-360 TokenFusion (RGB-LiDAR) mIoU 54.55 # 8
Semantic Segmentation KITTI-360 TokenFusion (RGB-Depth) mIoU 57.44 # 5
Semantic Segmentation LLRGBD-synthetic TokenFusion (SegFormer-B2) mIoU 64.75 # 4
Semantic Segmentation NYU Depth v2 TokenFusion (S) Mean IoU 54.2% # 18
Semantic Segmentation NYU Depth v2 TokenFusion (Ti) Mean IoU 53.3% # 26
3D Object Detection ScanNetV2 TokenFusion mAP@0.25 70.8 # 12
mAP@0.5 54.2 # 14
Semantic Segmentation SUN-RGBD TokenFusion (S) Mean IoU 53.0% # 1
Semantic Segmentation SUN-RGBD TokenFusion (Ti) Mean IoU 51.4% # 6
3D Object Detection SUN-RGBD val TokenFusion mAP@0.25 64.9 # 10
mAP@0.5 48.3 # 10

Methods


No methods listed for this paper. Add relevant methods here