Training data-efficient image transformers & distillation through attention

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Classification CIFAR-10 DeiT-B Percentage correct 99.1 # 8
PARAMS 86M # 208
Image Classification CIFAR-100 DeiT-B Percentage correct 90.8 # 21
PARAMS 86M # 168
Image Classification Flowers-102 DeiT-B Accuracy 98.8% # 16
PARAMS 86M # 47
Image Classification ImageNet DeiT-Ti Top 1 Accuracy 76.6% # 560
Number of params 5M # 257
Image Classification ImageNet DeiT-B 384 Top 1 Accuracy 85.2% # 130
Number of params 87M # 564
Hardware Burden None # 1
Operations per network pass None # 1
Image Classification ImageNet DeiT-S Top 1 Accuracy 82.6% # 286
Number of params 22M # 377
Image Classification ImageNet DeiT-B Top 1 Accuracy 84.2% # 173
Number of params 86M # 558
Image Classification ImageNet ReaL DeiT-Ti Accuracy 82.1% # 49
Params 5M # 34
Image Classification ImageNet ReaL DeiT-B Accuracy 88.7% # 25
Params 86M # 40
Image Classification ImageNet ReaL DeiT-B-384 Accuracy 89.3% # 23
Params 86M # 40
Image Classification ImageNet ReaL DeiT-S Accuracy 86.8% # 35
Params 22M # 36
Image Classification iNaturalist 2018 DeiT-B Top-1 Accuracy 79.5% # 10
Fine-Grained Image Classification Oxford 102 Flowers DeiT-B Accuracy 98.8% # 8
PARAMS 86M # 22
Document Layout Analysis PubLayNet val DeiT-B Text 0.934 # 3
Title 0.874 # 3
List 0.921 # 5
Table 0.972 # 5
Figure 0.957 # 4
Overall 0.932 # 4
Document Image Classification RVL-CDIP DeiT-B Accuracy 90.32% # 23
Parameters 87M # 12
Fine-Grained Image Classification Stanford Cars DeiT-B Accuracy 93.3% # 43
PARAMS 86M # 58

Methods