Tokens-To-Token Vision Transformer

Introduced by Yuan et al. in Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

T2T-ViT (Tokens-To-Token Vision Transformer) is a type of Vision Transformer which incorporates 1) a layerwise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study.

Source: Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Classification	3	18.75%
Image Generation	3	18.75%
Language Modelling	3	18.75%
Model Compression	2	12.50%
Computational Efficiency	1	6.25%
Efficient ViTs	1	6.25%
Density Estimation	1	6.25%
Object Recognition	1	6.25%
Object Detection	1	6.25%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Convolution	Convolutions
Dense Connections	Feedforward Networks
Dropout	Regularization
Layer Normalization	Normalization
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms

Categories

Add Remove

Vision Transformers

Image Models