T2T-ViT (Tokens-To-Token Vision Transformer) is a type of Vision Transformer which incorporates 1) a layerwise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study.
Source: Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNetPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Image Classification | 3 | 15.79% |
Image Generation | 3 | 15.79% |
Language Modeling | 3 | 15.79% |
Language Modelling | 3 | 15.79% |
Model Compression | 2 | 10.53% |
Computational Efficiency | 1 | 5.26% |
Efficient ViTs | 1 | 5.26% |
Density Estimation | 1 | 5.26% |
Object Recognition | 1 | 5.26% |