Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, \eg, the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification... (read more)

PDF Abstract

Datasets


Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Image Classification ImageNet T2T-ViT-14 Top 1 Accuracy 81.5% # 157
Number of params 21.5 # 212
Image Classification ImageNet T2T-ViTt-19 Top 1 Accuracy 82.4% # 132
Number of params 39.2 # 210
Image Classification ImageNet T2T-ViT-24 Top 1 Accuracy 82.3% # 136
Number of params 64.4 # 208
Image Classification ImageNet T2T-ViTt-24 Top 1 Accuracy 82.6% # 122
Number of params 64.4 # 208
Image Classification ImageNet T2T-ViT-19 Top 1 Accuracy 81.9% # 149
Image Classification ImageNet T2T-ViT-14|384 Top 1 Accuracy 83.3% # 98

Methods used in the Paper


METHOD TYPE
Vision Transformer
Image Models
Depthwise Convolution
Convolutions
Pointwise Convolution
Convolutions
Batch Normalization
Normalization
Depthwise Separable Convolution
Convolutions
ReLU
Activation Functions
Average Pooling
Pooling Operations
Sigmoid Activation
Activation Functions
Ghost Module
Image Model Blocks
Squeeze-and-Excitation Block
Image Model Blocks
1x1 Convolution
Convolutions
Ghost Bottleneck
Skip Connection Blocks
Convolution
Convolutions
Global Average Pooling
Pooling Operations
GhostNet
Convolutional Neural Networks
GELU
Activation Functions
Label Smoothing
Regularization
Layer Normalization
Normalization
Residual Connection
Skip Connections
Dense Connections
Feedforward Networks
Scaled Dot-Product Attention
Attention Mechanisms
BPE
Subword Segmentation
Adam
Stochastic Optimization
Dropout
Regularization
Softmax
Output Functions
Multi-Head Attention
Attention Modules
Transformer
Transformers