When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations

ICLR 2022  ·  Xiangning Chen, Cho-Jui Hsieh, Boqing Gong ·

Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pre-training and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rates). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3\% and +11.0\% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pre-training or strong data augmentations. Model checkpoints are available at \url{https://github.com/google-research/vision_transformer}.

PDF Abstract ICLR 2022 PDF ICLR 2022 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Classification CIFAR-10 ResNet-50-SAM Percentage correct 97.4 # 79
PARAMS 25M # 210
Parameters 25M # 7
Image Classification CIFAR-10 Mixer-B/16- SAM Percentage correct 97.8 # 64
Image Classification CIFAR-10 ViT-S/16- SAM Percentage correct 98.2 # 44
Image Classification CIFAR-10 ResNet-152-SAM Percentage correct 98.2 # 44
Image Classification CIFAR-10 ViT-B/16- SAM Percentage correct 98.6 # 30
PARAMS 87M # 235
Parameters 87M # 3
Image Classification CIFAR-10 Mixer-S/16- SAM Percentage correct 96.1 # 112
Image Classification CIFAR-100 ViT-B/16- SAM Percentage correct 89.1 # 34
Image Classification CIFAR-100 ViT-S/16- SAM Percentage correct 87.6 # 43
Image Classification CIFAR-100 ResNet-50-SAM Percentage correct 85.2 # 67
Image Classification CIFAR-100 Mixer-S/16- SAM Percentage correct 82.4 # 103
Image Classification CIFAR-100 Mixer-B/16- SAM Percentage correct 86.4 # 54
Image Classification Flowers-102 ResNet-50-SAM Accuracy 90 # 46
Image Classification Flowers-102 ViT-B/16- SAM Accuracy 91.8 # 43
Image Classification Flowers-102 ViT-S/16- SAM Accuracy 91.5 # 44
Image Classification Flowers-102 Mixer-S/16- SAM Accuracy 87.9 # 48
Image Classification Flowers-102 Mixer-B/16- SAM Accuracy 90 # 46
Image Classification Flowers-102 ResNet-152-SAM Accuracy 91.1 # 45
Image Classification ImageNet ResNet-152x2-SAM Top 1 Accuracy 81.1% # 607
Number of params 236M # 906
Image Classification ImageNet Mixer-B/8-SAM Top 1 Accuracy 79% # 727
Number of params 64M # 773
Image Classification ImageNet ViT-B/16-SAM Top 1 Accuracy 79.9% # 671
Number of params 87M # 822
Domain Generalization ImageNet-C ResNet-152x2-SAM Top 1 Accuracy 55 # 8
Domain Generalization ImageNet-C Mixer-B/8-SAM Top 1 Accuracy 48.9 # 10
Domain Generalization ImageNet-C ViT-B/16-SAM Top 1 Accuracy 56.5 # 6
Domain Generalization ImageNet-R Mixer-B/8-SAM Top-1 Error Rate 76.5 # 39
Domain Generalization ImageNet-R ViT-B/16-SAM Top-1 Error Rate 73.6 # 38
Domain Generalization ImageNet-R ResNet-152x2-SAM Top-1 Error Rate 71.9 # 37
Image Classification ImageNet ReaL ResNet-152x2-SAM Accuracy 86.4% # 37
Image Classification ImageNet ReaL Mixer-B/8-SAM Accuracy 84.4% # 45
Image Classification ImageNet ReaL ViT-B/16-SAM Accuracy 85.2% # 43
Image Classification ImageNet V2 Mixer-B/8-SAM Top 1 Accuracy 65.5 # 32
Image Classification ImageNet V2 ResNet-152x2-SAM Top 1 Accuracy 69.6 # 26
Image Classification ImageNet V2 ViT-B/16-SAM Top 1 Accuracy 67.5 # 29
Fine-Grained Image Classification Oxford-IIIT Pets ResNet-152-SAM Accuracy 93.3 # 6
Fine-Grained Image Classification Oxford-IIIT Pets ResNet-50-SAM Accuracy 91.6 # 10
Fine-Grained Image Classification Oxford-IIIT Pets ViT-B/16- SAM Accuracy 93.1 # 7
Fine-Grained Image Classification Oxford-IIIT Pets ViT-S/16- SAM Accuracy 92.9 # 8
Fine-Grained Image Classification Oxford-IIIT Pets Mixer-S/16- SAM Accuracy 88.7 # 11
Fine-Grained Image Classification Oxford-IIIT Pets Mixer-B/16- SAM Accuracy 92.5 # 9

Methods


No methods listed for this paper. Add relevant methods here