Dense Prediction Transformer

Introduced by Ranftl et al. in Vision Transformers for Dense Prediction

Dense Prediction Transformers (DPT) are a type of vision transformer for dense prediction tasks.

The input image is transformed into tokens (orange) either by extracting non-overlapping patches followed by a linear projection of their flattened representation (DPT-Base and DPT-Large) or by applying a ResNet-50 feature extractor (DPT-Hybrid). The image embedding is augmented with a positional embedding and a patch-independent readout token (red) is added. The tokens are passed through multiple transformer stages. The tokens are reassembled from different stages into an image-like representation at multiple resolutions (green). Fusion modules (purple) progressively fuse and upsample the representations to generate a fine-grained prediction.

Source: Vision Transformers for Dense Prediction

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Depth Estimation	4	8.51%
Monocular Depth Estimation	4	8.51%
Language Modelling	4	8.51%
Classification	2	4.26%
Image Classification	2	4.26%
Decision Making	2	4.26%
Question Answering	2	4.26%
Semantic Segmentation	2	4.26%
Object Recognition	1	2.13%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Convolution	Convolutions
Dense Connections	Feedforward Networks
Dot-Product Attention	Attention Mechanisms
Layer Normalization	Normalization
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms

Categories

Add Remove

Vision Transformers

Image Models