MambaVision: A Hybrid Mamba-Transformer Vision Backbone

CVPR 2025  ·  Ali Hatamizadeh, Jan Kautz ·

We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. Through a comprehensive ablation study, we demonstrate the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results show that equipping the Mamba architecture with self-attention blocks in the final layers greatly improves its capacity to capture long-range spatial dependencies. Based on these findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For classification on the ImageNet-1K dataset, MambaVision variants achieve state-of-the-art (SOTA) performance in terms of both Top-1 accuracy and throughput. In downstream tasks such as object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably sized backbones while demonstrating favorable performance. Code: https://github.com/NVlabs/MambaVision

PDF Abstract CVPR 2025 PDF CVPR 2025 Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Classification ImageNet MambaVision-L3 Top 1 Accuracy 88.1% # 58
GFLOPs 489.1 # 536
Image Classification ImageNet MambaVision-L2 Number of params 241.5M # 989
Image Classification ImageNet MambaVision-L Top 1 Accuracy 85% # 263
Number of params 227.9M # 987
GFLOPs 34.9 # 441
Image Classification ImageNet MambaVision-B Top 1 Accuracy 84.2% # 332
Number of params 97.7M # 932
GFLOPs 15 # 362
Image Classification ImageNet MambaVision-S Top 1 Accuracy 83.3% # 435
Number of params 50.1M # 786
GFLOPs 7.5 # 270
Image Classification ImageNet MambaVision-T2 Top 1 Accuracy 82.7% # 506
Number of params 35.1M # 717
GFLOPs 5.1 # 243
Image Classification ImageNet MambaVision-T Top 1 Accuracy 82.3% # 548
Number of params 31.8M # 707
GFLOPs 4.4 # 216

Methods


No methods listed for this paper. Add relevant methods here