ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

Convolutional architectures have proven extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling... (read more)

PDF Abstract

Datasets


Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Image Classification ImageNet ConViT-Ti Top 1 Accuracy 73.1% # 323
Number of params 6M # 181
Image Classification ImageNet ConViT-S Top 1 Accuracy 81.3% # 164
Number of params 27M # 122
Image Classification ImageNet ConViT-B+ Top 1 Accuracy 82.5% # 127
Number of params 152M # 29
Image Classification ImageNet ConViT-Ti+ Top 1 Accuracy 76.7% # 276
Number of params 10M # 161
Image Classification ImageNet ConViT-S+ Top 1 Accuracy 82.2% # 139
Number of params 48M # 89
Image Classification ImageNet ConViT-B Top 1 Accuracy 82.4% # 132
Number of params 86M # 50

Methods used in the Paper


METHOD TYPE
Dropout
Regularization
Multi-Head Attention
Attention Modules
Dense Connections
Feedforward Networks
Feedforward Network
Feedforward Networks
Softmax
Output Functions
Scaled Dot-Product Attention
Attention Mechanisms
Attention Dropout
Regularization
DeiT
Image Models