Escaping the Big Data Paradigm with Compact Transformers

12 Apr 2021  ·  Ali Hassani, Steven Walton, Nikhil Shah, Abulikemu Abuduweili, Jiachen Li, Humphrey Shi ·

With the rise of Transformers as the standard for language processing, and their advancements in computer vision, there has been a corresponding growth in parameter size and amounts of training data. Many have come to believe that because of this, transformers are not suitable for small sets of data. This trend leads to concerns such as: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field. In this paper, we aim to present an approach for small-scale learning by introducing Compact Transformers. We show for the first time that with the right size, convolutional tokenization, transformers can avoid overfitting and outperform state-of-the-art CNNs on small datasets. Our models are flexible in terms of model size, and can have as little as 0.28M parameters while achieving competitive results. Our best model can reach 98% accuracy when training from scratch on CIFAR-10 with only 3.7M parameters, which is a significant improvement in data-efficiency over previous Transformer based models being over 10x smaller than other transformers and is 15% the size of ResNet50 while achieving similar performance. CCT also outperforms many modern CNN based approaches, and even some recent NAS-based approaches. Additionally, we obtain a new SOTA result on Flowers-102 with 99.76% top-1 accuracy, and improve upon the existing baseline on ImageNet (82.71% accuracy with 29% as many parameters as ViT), as well as NLP tasks. Our simple and compact design for transformers makes them more feasible to study for those with limited computing resources and/or dealing with small datasets, while extending existing research efforts in data efficient transformers. Our code and pre-trained models are publicly available at

PDF Abstract

Results from the Paper

 Ranked #1 on Image Classification on Flowers-102 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Classification CIFAR-10 CCT-6/3x1 Percentage correct 95.29 # 126
PARAMS 3.17M # 190
Image Classification CIFAR-10 CCT-7/3x1* Percentage correct 98 # 52
PARAMS 3.76M # 192
Image Classification CIFAR-100 CCT-6/3x1 Percentage correct 77.31 # 136
PARAMS 3.17M # 183
Image Classification CIFAR-100 CCT-7/3x1* Percentage correct 82.72 # 97
Image Classification Flowers-102 CCT-14/7x2 Accuracy 99.76 # 1
Image Classification ImageNet CCT-16/7x2 Top 1 Accuracy 80.28% # 653
Image Classification ImageNet CCT-14/7x2 | 384 Top 1 Accuracy 82.71% # 465
Image Classification ImageNet CCT-14/7x2 Top 1 Accuracy 81.34% # 593
Number of params 22.36M # 568
GFLOPs 11.06 # 306
Fine-Grained Image Classification Oxford 102 Flowers CCT-14/7x2 FLOPS 15G # 3
PARAMS 22.5M # 23