Incorporating Convolution Designs into Visual Transformers

Motivated by the success of Transformers in natural language processing (NLP) tasks, there emerge some attempts (e.g., ViT and DeiT) to apply Transformers to the vision domain. However, pure Transformer architectures often require a large amount of training data or extra supervision to obtain comparable performance with convolutional neural networks (CNNs). To overcome these limitations, we analyze the potential drawbacks when directly borrowing Transformer architectures from NLP. Then we propose a new \textbf{Convolution-enhanced image Transformer (CeiT)} which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Three modifications are made to the original Transformer: \textbf{1)} instead of the straightforward tokenization from raw input images, we design an \textbf{Image-to-Tokens (I2T)} module that extracts patches from generated low-level features; \textbf{2)} the feed-froward network in each encoder block is replaced with a \textbf{Locally-enhanced Feed-Forward (LeFF)} layer that promotes the correlation among neighboring tokens in the spatial dimension; \textbf{3)} a \textbf{Layer-wise Class token Attention (LCA)} is attached at the top of the Transformer that utilizes the multi-level representations. Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers. Besides, CeiT models also demonstrate better convergence with $3\times$ fewer training iterations, which can reduce the training cost significantly\footnote{Code and models will be released upon acceptance.}.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Classification CIFAR-10 CeiT-S Percentage correct 99 # 23
Image Classification CIFAR-10 CeiT-S (384 finetune resolution) Percentage correct 99.1 # 15
Image Classification CIFAR-10 CeiT-T Percentage correct 98.5 # 42
Image Classification CIFAR-100 CeiT-S (384 finetune resolution) Percentage correct 91.8 # 17
Image Classification CIFAR-100 CeiT-S Percentage correct 91.8 # 17
Image Classification CIFAR-100 CeiT-T (384 finetune resolution) Percentage correct 88 # 41
Image Classification CIFAR-100 CeiT-T Percentage correct 89.4 # 32
Image Classification Flowers-102 CeiT-S (384 finetune resolution) Accuracy 98.6 # 21
Image Classification Flowers-102 CeiT-T Accuracy 96.9 # 38
Image Classification Flowers-102 CeiT-T (384 finetune resolution) Accuracy 97.8 # 34
Image Classification Flowers-102 CeiT-S Accuracy 98.2 # 27
Image Classification ImageNet CeiT-T Top 1 Accuracy 76.4% # 850
Number of params 6.4M # 443
GFLOPs 1.2 # 114
Image Classification ImageNet CeiT-S (384 finetune res) Top 1 Accuracy 83.3% # 408
Number of params 24.2M # 581
GFLOPs 12.9 # 325
Image Classification ImageNet CeiT-T (384 finetune res) Top 1 Accuracy 78.8% # 744
GFLOPs 3.6 # 181
Image Classification ImageNet CeiT-S Top 1 Accuracy 82% # 536
GFLOPs 4.5 # 211
Image Classification ImageNet ReaL CeiT-T Accuracy 83.6% # 47
Image Classification ImageNet ReaL CeiT-S (384 finetune res) Accuracy 88.1% # 28
Image Classification ImageNet ReaL CeiT-S Accuracy 87.3% # 34
Image Classification iNaturalist 2018 CeiT-S (384 finetune resolution) Top-1 Accuracy 79.4% # 19
Image Classification iNaturalist 2018 CeiT-S Top-1 Accuracy 73.3% # 32
Image Classification iNaturalist 2018 CeiT-T (384 finetune resolution) Top-1 Accuracy 72.2% # 34
Image Classification iNaturalist 2018 CeiT-T Top-1 Accuracy 64.3% # 50
Image Classification iNaturalist 2019 CeiT-T Top-1 Accuracy 72.8 # 16
Image Classification iNaturalist 2019 CeiT-T (384 finetune resolution) Top-1 Accuracy 77.9 # 13
Image Classification iNaturalist 2019 CeiT-S (384 finetune resolution) Top-1 Accuracy 82.7 # 9
Image Classification iNaturalist 2019 CeiT-S Top-1 Accuracy 78.9 # 12
Image Classification Oxford-IIIT Pets CeiT-T (384 finetune resolution) Accuracy 94.5 # 5
Image Classification Oxford-IIIT Pets CeiT-S Accuracy 94.6 # 4
Image Classification Oxford-IIIT Pets CeiT-T Accuracy 93.8 # 6
Image Classification Oxford-IIIT Pets CeiT-S (384 finetune resolution) Accuracy 94.9 # 2
Image Classification Stanford Cars CeiT-S Accuracy 93.2 # 9
Image Classification Stanford Cars CeiT-T (384 finetune resolution) Accuracy 93 # 11
Image Classification Stanford Cars CeiT-T Accuracy 90.5 # 13
Image Classification Stanford Cars CeiT-S (384 finetune resolution) Accuracy 94.1 # 6

Methods