TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	CIFAR-10	Transformer local-attention (NesT-B)	Percentage correct	97.2	# 84
Image Classification	CIFAR-10	Transformer local-attention (NesT-B)	PARAMS	90.1M	# 236
Image Classification	CIFAR-10	Transformer local-attention (NesT-B)	Top-1 Accuracy	97.2	# 19
Image Classification	CIFAR-10	Transformer local-attention (NesT-B)	Parameters	90.1M	# 2
Image Classification	CIFAR-100	Transformer local-attention (NesT-B)	Percentage correct	82.56	# 101
Image Classification	ImageNet	Transformer local-attention (NesT-B)	Top 1 Accuracy	83.8%	# 358
Image Classification	ImageNet	Transformer local-attention (NesT-B)	Number of params	68M	# 785
Image Classification	ImageNet	Transformer local-attention (NesT-B)	GFLOPs	17.9	# 357
Image Classification	ImageNet	Transformer local-attention (NesT-S)	Top 1 Accuracy	83.3%	# 403
Image Classification	ImageNet	Transformer local-attention (NesT-S)	Number of params	38M	# 663
Image Classification	ImageNet	Transformer local-attention (NesT-S)	GFLOPs	10.4	# 301
Image Classification	ImageNet	Transformer local-attention (NesT-T)	Top 1 Accuracy	81.5%	# 577
Image Classification	ImageNet	Transformer local-attention (NesT-T)	Number of params	17M	# 521
Image Classification	ImageNet	Transformer local-attention (NesT-T)	GFLOPs	5.8	# 239

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/aggregating-nested-transformers/image-classification-on-cifar-10)](https://paperswithcode.com/sota/image-classification-on-cifar-10?p=aggregating-nested-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/aggregating-nested-transformers/image-classification-on-cifar-100)](https://paperswithcode.com/sota/image-classification-on-cifar-100?p=aggregating-nested-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/aggregating-nested-transformers/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=aggregating-nested-transformers)`

Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding

26 May 2021 · Zizhao Zhang, Han Zhang, Long Zhao, Ting Chen, Sercan O. Arik, Tomas Pfister ·

Hierarchical structures are popular in recent vision transformers, however, they require sophisticated designs and massive datasets to work well. In this paper, we explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical way. We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication. This observation leads us to design a simplified architecture that requires minor code changes upon the original vision transformer. The benefits of the proposed judiciously-selected design are threefold: (1) NesT converges faster and requires much less training data to achieve good generalization on both ImageNet and small datasets like CIFAR; (2) when extending our key ideas to image generation, NesT leads to a strong decoder that is 8$\times$ faster than previous transformer-based generators; and (3) we show that decoupling the feature learning and abstraction processes via this nested hierarchy in our design enables constructing a novel method (named GradCAT) for visually interpreting the learned model. Source code is available https://github.com/google-research/nested-transformer.

PDF Abstract

Code

Add Remove Mark official

google-research/nested-transformer official

↳ Quickstart in

Colab

189

rwightman/pytorch-image-models

29,713

ttt496/vit-pytorch

ahmedelmahy/myownvit

freder-chen/vitp

See all 6 implementations

Tasks

Add Remove

Image Classification

Image Generation

Datasets

CIFAR-10

ImageNet

CIFAR-100

Results from the Paper

Edit

Ranked #84 on Image Classification on CIFAR-10

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	CIFAR-10	Transformer local-attention (NesT-B)	Percentage correct	97.2	# 84	Compare
			PARAMS	90.1M	# 236	Compare
			Top-1 Accuracy	97.2	# 19	Compare
			Parameters	90.1M	# 2	Compare
Image Classification	CIFAR-100	Transformer local-attention (NesT-B)	Percentage correct	82.56	# 101	Compare
Image Classification	ImageNet	Transformer local-attention (NesT-B)	Top 1 Accuracy	83.8%	# 358	Compare
			Number of params	68M	# 785	Compare
			GFLOPs	17.9	# 357	Compare
Image Classification	ImageNet	Transformer local-attention (NesT-S)	Top 1 Accuracy	83.3%	# 403	Compare
			Number of params	38M	# 663	Compare
			GFLOPs	10.4	# 301	Compare
Image Classification	ImageNet	Transformer local-attention (NesT-T)	Top 1 Accuracy	81.5%	# 577	Compare
			Number of params	17M	# 521	Compare
			GFLOPs	5.8	# 239	Compare

Methods

Add Remove

Dense Connections • Layer Normalization • Linear Layer • Multi-Head Attention • NesT • Residual Connection • Scaled Dot-Product Attention • Softmax • Vision Transformer

Edit Social Preview

Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove