Pyramid Vision Transformer

Introduced by Wang et al. in Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

PVT, or Pyramid Vision Transformer, is a type of vision transformer that utilizes a pyramid structure to make it an effective backbone for dense prediction tasks. Specifically it allows for more fine-grained inputs (4 x 4 pixels per patch) to be used, while simultaneously shrinking the sequence length of the Transformer as it deepens - reducing the computational cost. Additionally, a spatial-reduction attention (SRA) layer is used to further reduce the resource consumption when learning high-resolution features.

The entire model is divided into four stages, each of which is comprised of a patch embedding layer and a $\mathcal{L}_{i}$-layer Transformer encoder. Following a pyramid structure, the output resolution of the four stages progressively shrinks from high (4-stride) to low (32-stride).

Source: Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Object Detection	6	17.14%
Semantic Segmentation	5	14.29%
Instance Segmentation	3	8.57%
Image Classification	3	8.57%
Self-Supervised Learning	2	5.71%
Action Recognition	1	2.86%
Temporal Action Localization	1	2.86%
Computational Efficiency	1	2.86%
Denoising	1	2.86%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Absolute Position Encodings	Position Embeddings
Dense Connections	Feedforward Networks
Layer Normalization	Normalization
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms
Spatial-Reduction Attention	Attention Modules

Categories

Add Remove

Vision Transformers