Vision Transformers

Dense Prediction Transformer

Introduced by Ranftl et al. in Vision Transformers for Dense Prediction

Dense Prediction Transformers (DPT) are a type of vision transformer for dense prediction tasks.

The input image is transformed into tokens (orange) either by extracting non-overlapping patches followed by a linear projection of their flattened representation (DPT-Base and DPT-Large) or by applying a ResNet-50 feature extractor (DPT-Hybrid). The image embedding is augmented with a positional embedding and a patch-independent readout token (red) is added. The tokens are passed through multiple transformer stages. The tokens are reassembled from different stages into an image-like representation at multiple resolutions (green). Fusion modules (purple) progressively fuse and upsample the representations to generate a fine-grained prediction.

Source: Vision Transformers for Dense Prediction


Paper Code Results Date Stars