Going deeper with Image Transformers

Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However the optimization of image transformers has been little studied so far. In this work, we build and optimize deeper transformer networks for image classification. In particular, we investigate the interplay of architecture and optimization of such dedicated transformers. We make two transformers architecture changes that significantly improve the accuracy of deep transformers. This leads us to produce models whose performance does not saturate early with more depth, for instance we obtain 86.5% top-1 accuracy on Imagenet when training with no external data, we thus attain the current SOTA with less FLOPs and parameters. Moreover, our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency, in the setting with no additional training data. We share our code and models.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Results from the Paper


Ranked #5 on Image Classification on CIFAR-10 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Classification CIFAR-10 CaiT-M-36 U 224 Percentage correct 99.4 # 5
Image Classification CIFAR-100 CaiT-M-36 U 224 Percentage correct 93.1 # 11
Image Classification Flowers-102 CaiT-M-36 U 224 Accuracy 99.1 # 13
Image Classification ImageNet CAIT-XXS-36 Top 1 Accuracy 82.2% # 510
Number of params 17.3M # 522
GFLOPs 14.3 # 335
Image Classification ImageNet CAIT-XS-24 Top 1 Accuracy 84.1% # 325
Number of params 26.6M # 613
GFLOPs 19.3 # 364
Image Classification ImageNet CAIT-XS-36 Top 1 Accuracy 84.8% # 270
Number of params 38.6M # 665
GFLOPs 28.8 # 389
Image Classification ImageNet CAIT-S-24 Top 1 Accuracy 85.1% # 245
Number of params 46.9M # 710
GFLOPs 32.2 # 398
Image Classification ImageNet CAIT-S-36 Top 1 Accuracy 85.4% # 221
Number of params 68.2M # 786
GFLOPs 48 # 421
Image Classification ImageNet CAIT-M-24 Top 1 Accuracy 85.8% # 187
Number of params 185.9M # 887
GFLOPs 116.1 # 458
Image Classification ImageNet CAIT-M-36 Top 1 Accuracy 86.1% # 170
Number of params 270.9M # 908
GFLOPs 173.3 # 464
Image Classification ImageNet CAIT-M36-448 Top 1 Accuracy 86.3% # 153
Number of params 271M # 909
GFLOPs 247.8 # 472
Image Classification ImageNet CAIT-S-48 Top 1 Accuracy 85.3% # 231
Number of params 89.5M # 845
GFLOPs 63.8 # 435
Image Classification ImageNet CAIT-XXS-24 Top 1 Accuracy 80.9% # 618
Number of params 12M # 496
GFLOPs 9.6 # 292
Image Classification ImageNet CaiT-M-48-448 Top 1 Accuracy 86.5% # 135
Number of params 438M # 930
GFLOPs 377.3 # 480
Image Classification ImageNet ReaL CAIT-M36-448 Accuracy 90.2% # 19
Image Classification ImageNet V2 CAIT-M36-448 Top 1 Accuracy 76.7 # 16
Image Classification iNaturalist 2018 CaiT-M-36 U 224 Top-1 Accuracy 78% # 18
Image Classification iNaturalist 2019 CaiT-M-36 U 224 Top-1 Accuracy 81.8 # 7
Image Classification Stanford Cars CaiT-M-36 U 224 Accuracy 94.2 # 5

Methods