MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features

30 Sep 2022  ·  Shakti N. Wadekar, Abhishek Chaurasia ·

MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-the-art results, the fusion block inside MobileViTv1-block, creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simple and effective to create MobileViTv3-block, which addresses the scaling and simplifies the learning task. Our proposed MobileViTv3-block used to create MobileViTv3-XXS, XS and S models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets. On ImageNet-1K, MobileViTv3-XXS and MobileViTv3-XS surpasses MobileViTv1-XXS and MobileViTv1-XS by 2% and 1.9% respectively. Recently published MobileViTv2 architecture removes fusion block and uses linear complexity transformers to perform better than MobileViTv1. We add our proposed fusion block to MobileViTv2 to create MobileViTv3-0.5, 0.75 and 1.0 models. These new models give better accuracy numbers on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets as compared to MobileViTv2. MobileViTv3-0.5 and MobileViTv3-0.75 outperforms MobileViTv2-0.5 and MobileViTv2-0.75 by 2.1% and 1.0% respectively on ImageNet-1K dataset. For segmentation task, MobileViTv3-1.0 achieves 2.07% and 1.1% better mIOU compared to MobileViTv2-1.0 on ADE20K dataset and PascalVOC2012 dataset respectively. Our code and the trained models are available at:

PDF Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Classification ImageNet MobileViTv3-S Top 1 Accuracy 79.3% # 545
Number of params 5.8 M # 9
GFLOPs 1.8 # 129
Image Classification ImageNet MobileViTv3-0.5 Top 1 Accuracy 72.33% # 745
Number of params 1.4 M # 2
GFLOPs 0.5 # 48
Image Classification ImageNet MobileViTv3-1.0 Top 1 Accuracy 78.64% # 588
Number of params 5.1 M # 7
GFLOPs 1.9 # 133
Image Classification ImageNet MobileViTv3-0.75 Top 1 Accuracy 76.55% # 666
Number of params 3 M # 5
GFLOPs 1.1 # 100
Image Classification ImageNet MobileViTv3-XXS Top 1 Accuracy 70.98% # 758
Number of params 1.2 M # 1
GFLOPs 0.3 # 24
Image Classification ImageNet MobileViTv3-XS Top 1 Accuracy 76.7% # 657
Number of params 2.5 M # 4
GFLOPs 0.9 # 93