TinyViT: Fast Pretraining Distillation for Small Vision Transformers

21 Jul 2022  ·  Kan Wu, Jinnian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong Fu, Lu Yuan ·

Vision transformer (ViT) recently has drawn great attention in computer vision due to its remarkable model capability. However, most prevailing ViT models suffer from huge number of parameters, restricting their applicability on devices with limited resources. To alleviate this issue, we propose TinyViT, a new family of tiny and efficient small vision transformers pretrained on large-scale datasets with our proposed fast distillation framework. The central idea is to transfer knowledge from large pretrained models to small ones, while enabling small models to get the dividends of massive pretraining data. More specifically, we apply distillation during pretraining for knowledge transfer. The logits of large teacher models are sparsified and stored in disk in advance to save the memory cost and computation overheads. The tiny student transformers are automatically scaled down from a large pretrained model with computation and parameter constraints. Comprehensive experiments demonstrate the efficacy of TinyViT. It achieves a top-1 accuracy of 84.8% on ImageNet-1k with only 21M parameters, being comparable to Swin-B pretrained on ImageNet-21k while using 4.2 times fewer parameters. Moreover, increasing image resolutions, TinyViT can reach 86.5% accuracy, being slightly better than Swin-L while using only 11% parameters. Last but not the least, we demonstrate a good transfer ability of TinyViT on various downstream tasks. Code and models are available at https://github.com/microsoft/Cream/tree/main/TinyViT.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Classification ImageNet TinyViT-5M-distill (21k) Top 1 Accuracy 80.7% # 644
Number of params 5.4M # 426
GFLOPs 1.3 # 120
Image Classification ImageNet TinyViT-11M-distill (21k) Top 1 Accuracy 83.2% # 421
Number of params 11M # 492
GFLOPs 2.0 # 149
Image Classification ImageNet TinyViT-21M-distill (21k) Top 1 Accuracy 84.8% # 273
Number of params 21M # 553
GFLOPs 4.3 # 206
Image Classification ImageNet TinyViT-21M-384-distill (384 res, 21k) Top 1 Accuracy 86.2% # 165
Number of params 21M # 553
GFLOPs 13.8 # 340
Image Classification ImageNet TinyViT-21M-512-distill (512 res, 21k) Top 1 Accuracy 86.5% # 136
Number of params 21M # 553
GFLOPs 27.0 # 397
Image Classification ImageNet TinyViT-5M Top 1 Accuracy 79.1% # 730
Number of params 5.4M # 426
GFLOPs 1.3 # 120
Image Classification ImageNet TinyViT-11M Top 1 Accuracy 81.5% # 590
Number of params 11M # 492
GFLOPs 2.0 # 149
Image Classification ImageNet TinyViT-21M Top 1 Accuracy 83.1% # 434
Number of params 21M # 553
GFLOPs 4.3 # 206

Methods


No methods listed for this paper. Add relevant methods here