Convolutional Vision Transformer

Introduced by Wu et al. in CvT: Introducing Convolutions to Vision Transformers

The Convolutional vision Transformer (CvT) is an architecture which incorporates convolutions into the Transformer. The CvT design introduces convolutions to two core sections of the ViT architecture.

First, the Transformers are partitioned into multiple stages that form a hierarchical structure of Transformers. The beginning of each stage consists of a convolutional token embedding that performs an overlapping convolution operation with stride on a 2D-reshaped token map (i.e., reshaping flattened token sequences back to the spatial grid), followed by layer normalization. This allows the model to not only capture local information, but also progressively decrease the sequence length while simultaneously increasing the dimension of token features across stages, achieving spatial downsampling while concurrently increasing the number of feature maps, as is performed in CNNs.

Second, the linear projection prior to every self-attention block in the Transformer module is replaced with a proposed convolutional projection, which employs a s × s depth-wise separable convolution operation on an 2D-reshaped token map. This allows the model to further capture local spatial context and reduce semantic ambiguity in the attention mechanism. It also permits management of computational complexity, as the stride of convolution can be used to subsample the key and value matrices to improve efficiency by 4× or more, with minimal degradation of performance.

Source: CvT: Introducing Convolutions to Vision Transformers

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Classification	3	42.86%
Semantic Segmentation	2	28.57%
Instance Segmentation	1	14.29%
Continual Learning	1	14.29%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Average Pooling	Pooling Operations
Batch Normalization	Normalization
Convolution	Convolutions
Dense Connections	Feedforward Networks
Depthwise Separable Convolution	Convolutions
Dropout	Regularization
Layer Normalization	Normalization
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms

Categories

Add Remove

Vision Transformers

Image Models