TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	CIFAR-10	TNT-B	Percentage correct	99.1	# 12
Image Classification	CIFAR-10	TNT-B	PARAMS	65.6M	# 233
Image Classification	CIFAR-100	TNT-B	Percentage correct	91.1	# 22
Image Classification	CIFAR-100	TNT-B	PARAMS	65.6M	# 197
Image Classification	ImageNet	TNT-B	Top 1 Accuracy	83.9%	# 347
Image Classification	ImageNet	TNT-B	Number of params	65.6M	# 775
Image Classification	ImageNet	TNT-B	Hardware Burden	None	# 1
Image Classification	ImageNet	TNT-B	Operations per network pass	None	# 1
Fine-Grained Image Classification	Oxford 102 Flowers	TNT-B	Accuracy	99.0%	# 9
Fine-Grained Image Classification	Oxford 102 Flowers	TNT-B	PARAMS	65.6M	# 25
Fine-Grained Image Classification	Oxford-IIIT Pet Dataset	TNT-B	Accuracy	95.0%	# 9
Fine-Grained Image Classification	Oxford-IIIT Pet Dataset	TNT-B	PARAMS	65.6M	# 17

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/transformer-in-transformer/fine-grained-image-classification-on-oxford)](https://paperswithcode.com/sota/fine-grained-image-classification-on-oxford?p=transformer-in-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/transformer-in-transformer/fine-grained-image-classification-on-oxford-1)](https://paperswithcode.com/sota/fine-grained-image-classification-on-oxford-1?p=transformer-in-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/transformer-in-transformer/image-classification-on-cifar-10)](https://paperswithcode.com/sota/image-classification-on-cifar-10?p=transformer-in-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/transformer-in-transformer/image-classification-on-cifar-100)](https://paperswithcode.com/sota/image-classification-on-cifar-100?p=transformer-in-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/transformer-in-transformer/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=transformer-in-transformer)`

Transformer in Transformer

NeurIPS 2021 · Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, Yunhe Wang ·

Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16$\times$16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4$\times$4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost. The PyTorch code is available at https://github.com/huawei-noah/CV-Backbones, and the MindSpore code is available at https://gitee.com/mindspore/models/tree/master/research/cv/TNT.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract

Code

Add Remove Mark official

huawei-noah/CV-Backbones official

3,803

huawei-noah/CV-backbones official

3,803

rwightman/pytorch-image-models

29,785

PaddlePaddle/PaddleClas

5,260

open-mmlab/mmclassification

3,160

See all 12 implementations

Tasks

Add Remove

Fine-Grained Image Classification

Image Classification

Sentence

Datasets

CIFAR-10

ImageNet

MS COCO

CIFAR-100

Oxford 102 Flower

ADE20K

iNaturalist

Oxford-IIIT Pet Dataset Oxford-IIIT Pets

Results from the Paper

Edit

Ranked #9 on Fine-Grained Image Classification on Oxford-IIIT Pet Dataset

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	CIFAR-10	TNT-B	Percentage correct	99.1	# 12	Compare
Image Classification	CIFAR-10	TNT-B	PARAMS	65.6M	# 233	Compare
Image Classification	CIFAR-100	TNT-B	Percentage correct	91.1	# 22	Compare
Image Classification	CIFAR-100	TNT-B	PARAMS	65.6M	# 197	Compare
Image Classification	ImageNet	TNT-B	Top 1 Accuracy	83.9%	# 347	Compare
			Number of params	65.6M	# 775	Compare
			Hardware Burden	None	# 1	Compare
			Operations per network pass	None	# 1	Compare
Fine-Grained Image Classification	Oxford 102 Flowers	TNT-B	Accuracy	99.0%	# 9	Compare
Fine-Grained Image Classification	Oxford 102 Flowers	TNT-B	PARAMS	65.6M	# 25	Compare
Fine-Grained Image Classification	Oxford-IIIT Pet Dataset	TNT-B	Accuracy	95.0%	# 9	Compare
Fine-Grained Image Classification	Oxford-IIIT Pet Dataset	TNT-B	PARAMS	65.6M	# 17	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • Attention Dropout • BPE • DeiT • Dense Connections • Dropout • Feedforward Network • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • TNT • Transformer

Edit Social Preview

Transformer in Transformer

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove