Discrete Representations Strengthen Vision Transformer Robustness

Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. While recent studies suggest that ViTs are more robust than their convolutional counterparts, our experiments find that ViTs trained on ImageNet are overly reliant on local textures and fail to make adequate use of shape information. ViTs thus have difficulties generalizing to out-of-distribution, real-world data. To address this deficiency, we present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder. Different from the standard continuous pixel tokens, discrete tokens are invariant under small perturbations and contain less information individually, which promote ViTs to learn global information that is invariant. Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks while maintaining the performance on ImageNet.

PDF Abstract ICLR 2022 PDF ICLR 2022 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Classification ImageNet DiscreteViT Top 1 Accuracy 85.07% # 254
Domain Generalization ImageNet-C DrViT mean Corruption Error (mCE) 46.22 # 21
Domain Generalization ImageNet-C DiscreteViT mean Corruption Error (mCE) 46.22 # 21
Number of params 87M # 35
Domain Generalization ImageNet-C DiscreteViT (Im21k) mean Corruption Error (mCE) 38.74 # 11
Number of params 87M # 35
Domain Generalization ImageNet-R DiscreteViT Top-1 Error Rate 44.74 # 20
Domain Generalization ImageNet-Sketch DrViT Top-1 accuracy 44.72 # 16
Image Classification ObjectNet ViT-B (Discrete 512x512) Top-1 Accuracy 46.62 # 35
Domain Generalization Stylized-ImageNet DiscreteViT Top 1 Accuracy 22.19 # 3

Methods