Going deeper with Image Transformers

Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However the optimization of image transformers has been little studied so far. In this work, we build and optimize deeper transformer networks for image classification. In particular, we investigate the interplay of architecture and optimization of such dedicated transformers. We make two transformers architecture changes that significantly improve the accuracy of deep transformers. This leads us to produce models whose performance does not saturate early with more depth, for instance we obtain 86.5% top-1 accuracy on Imagenet when training with no external data, we thus attain the current SOTA with less FLOPs and parameters. Moreover, our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency, in the setting with no additional training data. We share our code and models.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Results from the Paper


 Ranked #1 on Image Classification on CIFAR-10 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Classification CIFAR-10 CaiT-M-36 U 224 Percentage correct 99.4 # 1
Image Classification CIFAR-100 CaiT-M-36 U 224 Percentage correct 93.1 # 7
Image Classification Flowers-102 CaiT-M-36 U 224 Accuracy 99.1 # 10
Image Classification ImageNet CAIT-XS-24 Top 1 Accuracy 84.1% # 166
Number of params 26.6M # 220
Image Classification ImageNet CAIT-M36-448 Top 1 Accuracy 86.3% # 72
Number of params 271M # 47
Image Classification ImageNet CAIT-XXS-24 Top 1 Accuracy 80.9% # 344
Number of params 12M # 282
Image Classification ImageNet CAIT-S-48 Top 1 Accuracy 85.3% # 115
Number of params 89.5M # 82
Image Classification ImageNet CAIT-S-36 Top 1 Accuracy 85.4% # 109
Number of params 68.2M # 122
Image Classification ImageNet CAIT-S-24 Top 1 Accuracy 85.1% # 124
Number of params 46.9M # 169
Image Classification ImageNet CAIT-XS-36 Top 1 Accuracy 84.8% # 138
Number of params 38.6M # 195
Image Classification ImageNet CAIT-XXS-36 Top 1 Accuracy 82.2% # 280
Number of params 17.3M # 274
Image Classification ImageNet CaiT-M-48-448 Top 1 Accuracy 86.5% # 64
Number of params 356M # 38
Hardware Burden None # 1
Operations per network pass None # 1
Image Classification ImageNet CAIT-M-36 Top 1 Accuracy 86.1% # 79
Number of params 270.9M # 48
Image Classification ImageNet CAIT-M-24 Top 1 Accuracy 85.8% # 91
Number of params 185.9M # 63
Image Classification ImageNet ReaL CAIT-M36-448 Accuracy 90.2% # 17
Image Classification ImageNet V2 CAIT-M36-448 Top 1 Accuracy 76.7 # 7
Image Classification iNaturalist 2018 CaiT-M-36 U 224 Top-1 Accuracy 78% # 11
Image Classification iNaturalist 2019 CaiT-M-36 U 224 Top-1 Accuracy 81.8 # 4
Image Classification Stanford Cars CaiT-M-36 U 224 Accuracy 94.2 # 4

Methods