NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training

Designing accurate and efficient vision transformers (ViTs) is a highly important but challenging task. Supernet-based one-shot neural architecture search (NAS) enables fast architecture optimization and has achieved state-of-the-art (SOTA) results on convolutional neural networks (CNNs). However, directly applying the supernet-based NAS to optimize ViTs leads to poor performance - even worse compared to training single ViTs. In this work, we observe that the poor performance is due to a gradient conflict issue: the gradients of different sub-networks conflict with that of the supernet more severely in ViTs than CNNs, which leads to early saturation in training and inferior convergence. To alleviate this issue, we propose a series of techniques, including a gradient projection algorithm, a switchable layer scaling design, and a simplified data augmentation and regularization training recipe. The proposed techniques significantly improve the convergence and the performance of all sub-networks. Our discovered hybrid ViT model family, dubbed NASViT, achieves top-1 accuracy from 78.2% to 81.8% on ImageNet from 200M to 800M FLOPs, and outperforms all the prior art CNNs and ViTs, including AlphaNet and LeViT, etc. When transferred to semantic segmentation tasks, NASViTs also outperform previous backbones on both Cityscape and ADE20K datasets, achieving 73.2% and 37.9% mIoU with only 5G FLOPs, respectively.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Image Classification ImageNet NASViT-A2 Top 1 Accuracy 80.5% # 644
GFLOPs 0.421 # 46
Image Classification ImageNet NASViT-A1 Top 1 Accuracy 79.7% # 691
GFLOPs 0.309 # 29
Image Classification ImageNet NASViT-A0 Top 1 Accuracy 78.2% # 784
GFLOPs 0.208 # 15
Neural Architecture Search ImageNet NASViT-A5 Top-1 Error Rate 18.2 # 7
Accuracy 81.8 # 5
FLOPs 757M # 135
Image Classification ImageNet NASViT-A5 Top 1 Accuracy 81.8% # 559
GFLOPs 0.757 # 91
Image Classification ImageNet NASViT (supernet) Top 1 Accuracy 82.9% # 450
GFLOPs 1.881 # 144
Image Classification ImageNet NASViT-A3 Top 1 Accuracy 81.0% # 620
GFLOPs 0.528 # 55
Image Classification ImageNet NASViT-A4 Top 1 Accuracy 81.4% # 592
GFLOPs 0.591 # 62
Neural Architecture Search ImageNet NASViT-A4 Top-1 Error Rate 18.6 # 10
Accuracy 81.4 # 7
FLOPs 591M # 128
Neural Architecture Search ImageNet NASViT-A3 Top-1 Error Rate 19.0 # 11
Accuracy 81.0 # 8
FLOPs 528M # 125
Neural Architecture Search ImageNet NASViT-A2 Top-1 Error Rate 19.5 # 15
Accuracy 80.5 # 11
FLOPs 421M # 120
Neural Architecture Search ImageNet NASViT-A1 Top-1 Error Rate 20.3 # 27
Accuracy 79.7 # 21
FLOPs 309M # 114
Neural Architecture Search ImageNet NASViT-A0 Top-1 Error Rate 21.8 # 51
Accuracy 78.2 # 40
FLOPs 208M # 111

Methods


No methods listed for this paper. Add relevant methods here