Image Models

ConViT is a type of vision transformer that uses a gated positional self-attention module (GPSA), a form of positional self-attention which can be equipped with a “soft” convolutional inductive bias. The GPSA layers are initialized to mimic the locality of convolutional layers, then each attention head is given the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information.

Source: ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases


Paper Code Results Date Stars


Task Papers Share
Image Classification 1 100.00%