TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	SReT-B (384 res, ImageNet-1K only)	Top 1 Accuracy	84.8%	# 270
Image Classification	ImageNet	SReT-B (384 res, ImageNet-1K only)	Number of params	71.2M	# 788
Image Classification	ImageNet	SReT-S (512 res, ImageNet-1K only)	Top 1 Accuracy	84.3%	# 305
Image Classification	ImageNet	SReT-S (512 res, ImageNet-1K only)	Number of params	21.3M	# 550
Image Classification	ImageNet	SReT-S (512 res, ImageNet-1K only)	GFLOPs	42.8	# 415
Image Classification	ImageNet	SReT-S (384 res, ImageNet-1K only)	Top 1 Accuracy	83.8%	# 358
Image Classification	ImageNet	SReT-S (384 res, ImageNet-1K only)	Number of params	21M	# 545
Image Classification	ImageNet	SReT-S (384 res, ImageNet-1K only)	GFLOPs	18.5	# 359
Image Classification	ImageNet	SReT-T	Top 1 Accuracy	77.6%	# 800
Image Classification	ImageNet	SReT-T	Number of params	4.8M	# 394
Image Classification	ImageNet	SReT-T	GFLOPs	1.1	# 109
Image Classification	ImageNet	SReT-ExT	Top 1 Accuracy	74.0%	# 909
Image Classification	ImageNet	SReT-ExT	Number of params	4M	# 377
Image Classification	ImageNet	SReT-ExT	GFLOPs	0.7	# 83

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/sliced-recursive-transformer-1/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=sliced-recursive-transformer-1)`

Sliced Recursive Transformer

9 Nov 2021 · Zhiqiang Shen, Zechun Liu, Eric Xing ·

We present a neat yet effective recursive operation on vision transformers that can improve parameter utilization without involving additional parameters. This is achieved by sharing weights across the depth of transformer networks. The proposed method can obtain a substantial gain (~2%) simply using naive recursive operation, requires no special or sophisticated knowledge for designing principles of networks, and introduces minimal computational overhead to the training procedure. To reduce the additional computation caused by recursive operation while maintaining the superior accuracy, we propose an approximating method through multiple sliced group self-attentions across recursive layers which can reduce the cost consumption by 10~30% with minimal performance loss. We call our model Sliced Recursive Transformer (SReT), a novel and parameter-efficient vision transformer design that is compatible with a broad range of other designs for efficient ViT architectures. Our best model establishes significant improvement on ImageNet-1K over state-of-the-art methods while containing fewer parameters. The proposed weight sharing mechanism by sliced recursion structure allows us to build a transformer with more than 100 or even 1000 shared layers with ease while keeping a compact size (13~15M), to avoid optimization difficulties when the model is too large. The flexible scalability has shown great potential for scaling up models and constructing extremely deep vision transformers. Code is available at https://github.com/szq0214/SReT.

PDF Abstract

Code

Add Remove Mark official

szq0214/sret official

Tasks

Add Remove

Image Classification

Datasets

ImageNet

Results from the Paper

Edit

Ranked #270 on Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	SReT-B (384 res, ImageNet-1K only)	Top 1 Accuracy	84.8%	# 270	Compare
Image Classification	ImageNet	SReT-B (384 res, ImageNet-1K only)	Number of params	71.2M	# 788	Compare
Image Classification	ImageNet	SReT-S (512 res, ImageNet-1K only)	Top 1 Accuracy	84.3%	# 305	Compare
			Number of params	21.3M	# 550	Compare
			GFLOPs	42.8	# 415	Compare
Image Classification	ImageNet	SReT-S (384 res, ImageNet-1K only)	Top 1 Accuracy	83.8%	# 358	Compare
			Number of params	21M	# 545	Compare
			GFLOPs	18.5	# 359	Compare
Image Classification	ImageNet	SReT-T	Top 1 Accuracy	77.6%	# 800	Compare
			Number of params	4.8M	# 394	Compare
			GFLOPs	1.1	# 109	Compare
Image Classification	ImageNet	SReT-ExT	Top 1 Accuracy	74.0%	# 909	Compare
			Number of params	4M	# 377	Compare
			GFLOPs	0.7	# 83	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

Sliced Recursive Transformer

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove