TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	VOLO-D5+HAT	Top 1 Accuracy	87.3%	# 99
Image Classification	ImageNet	VOLO-D5+HAT	Number of params	295.5M	# 910
Image Classification	ImageNet	VOLO-D5+HAT	GFLOPs	412	# 481
Domain Generalization	ImageNet-C	VOLO-D5+HAT	mean Corruption Error (mCE)	38.4	# 10
Domain Generalization	ImageNet-C	VOLO-D5+HAT	Number of params	296M	# 38
Domain Generalization	ImageNet-R	VOLO-D5+HAT	Top-1 Error Rate	40.3	# 16
Domain Generalization	Stylized-ImageNet	VOLO-D5+HAT	Top 1 Accuracy	25.9	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improving-vision-transformers-by-revisiting/domain-generalization-on-stylized-imagenet)](https://paperswithcode.com/sota/domain-generalization-on-stylized-imagenet?p=improving-vision-transformers-by-revisiting)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improving-vision-transformers-by-revisiting/domain-generalization-on-imagenet-c)](https://paperswithcode.com/sota/domain-generalization-on-imagenet-c?p=improving-vision-transformers-by-revisiting)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improving-vision-transformers-by-revisiting/domain-generalization-on-imagenet-r)](https://paperswithcode.com/sota/domain-generalization-on-imagenet-r?p=improving-vision-transformers-by-revisiting)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improving-vision-transformers-by-revisiting/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=improving-vision-transformers-by-revisiting)`

Improving Vision Transformers by Revisiting High-frequency Components

3 Apr 2022 · Jiawang Bai, Li Yuan, Shu-Tao Xia, Shuicheng Yan, Zhifeng Li, Wei Liu ·

The transformer models have shown promising effectiveness in dealing with various vision tasks. However, compared with training Convolutional Neural Network (CNN) models, training Vision Transformer (ViT) models is more difficult and relies on the large-scale training set. To explain this observation we make a hypothesis that \textit{ViT models are less effective in capturing the high-frequency components of images than CNN models}, and verify it by a frequency analysis. Inspired by this finding, we first investigate the effects of existing techniques for improving ViT models from a new frequency perspective, and find that the success of some techniques (e.g., RandAugment) can be attributed to the better usage of the high-frequency components. Then, to compensate for this insufficient ability of ViT models, we propose HAT, which directly augments high-frequency components of images via adversarial training. We show that HAT can consistently boost the performance of various ViT models (e.g., +1.2% for ViT-B, +0.5% for Swin-B), and especially enhance the advanced model VOLO-D5 to 87.3% that only uses ImageNet-1K data, and the superiority can also be maintained on out-of-distribution data and transferred to downstream tasks. The code is available at: https://github.com/jiawangbai/HAT.

PDF Abstract

Code

Add Remove Mark official

jiawangbai/HAT official

Tasks

Add Remove

Domain Generalization

Image Classification

Vocal Bursts Intensity Prediction

Datasets

ImageNet

MS COCO

ImageNet-C

ImageNet-R

ImageNet-A

Results from the Paper

Edit

Ranked #2 on Domain Generalization on Stylized-ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	VOLO-D5+HAT	Top 1 Accuracy	87.3%	# 99	Compare
			Number of params	295.5M	# 910	Compare
			GFLOPs	412	# 481	Compare
Domain Generalization	ImageNet-C	VOLO-D5+HAT	mean Corruption Error (mCE)	38.4	# 10	Compare
Domain Generalization	ImageNet-C	VOLO-D5+HAT	Number of params	296M	# 38	Compare
Domain Generalization	ImageNet-R	VOLO-D5+HAT	Top-1 Error Rate	40.3	# 16	Compare
Domain Generalization	Stylized-ImageNet	VOLO-D5+HAT	Top 1 Accuracy	25.9	# 2	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

Improving Vision Transformers by Revisiting High-frequency Components

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove