Training data-efficient image transformers & distillation through attention

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Classification CIFAR-10 DeiT-B Percentage correct 99.1 # 12
PARAMS 86M # 234
Image Classification CIFAR-100 DeiT-B Percentage correct 90.8 # 23
PARAMS 86M # 198
Image Classification Flowers-102 DeiT-B Accuracy 98.8% # 17
PARAMS 86M # 50
Image Classification ImageNet DeiT-B Top 1 Accuracy 82.6% # 474
Number of params 22M # 557
Top 1 Accuracy 84.2% # 313
Number of params 86M # 814
Image Classification ImageNet DeiT-B 384 Top 1 Accuracy 85.2% # 239
Number of params 87M # 822
Hardware Burden None # 1
Operations per network pass None # 1
Image Classification ImageNet DeiT-B Top 1 Accuracy 76.6% # 839
Number of params 5M # 403
Efficient ViTs ImageNet-1K (with DeiT-S) Base (DeiT-S) Top 1 Accuracy 79.8 # 4
GFLOPs 4.6 # 41
Efficient ViTs ImageNet-1K (with DeiT-T) Base (DeiT-T) Top 1 Accuracy 72.2 # 6
GFLOPs 1.2 # 22
Image Classification ImageNet ReaL DeiT-Ti Accuracy 82.1% # 49
Params 5M # 37
Image Classification ImageNet ReaL DeiT-B Accuracy 88.7% # 26
Params 86M # 43
Image Classification ImageNet ReaL DeiT-B-384 Accuracy 89.3% # 24
Params 86M # 43
Image Classification ImageNet ReaL DeiT-S Accuracy 86.8% # 36
Params 22M # 39
Image Classification iNaturalist 2018 DeiT-B Top-1 Accuracy 79.5% # 15
Fine-Grained Image Classification Oxford 102 Flowers DeiT-B Accuracy 98.8% # 11
PARAMS 86M # 26
Document Layout Analysis PubLayNet val DeiT-B Text 0.934 # 8
Title 0.874 # 8
List 0.921 # 10
Table 0.972 # 11
Figure 0.957 # 9
Overall 0.932 # 9
Fine-Grained Image Classification Stanford Cars DeiT-B Accuracy 93.3% # 56
PARAMS 86M # 73

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Uses Extra
Training Data
Source Paper Compare
Document Image Classification RVL-CDIP DeiT-B Accuracy 90.32% # 29
Parameters 87M # 15

Methods