Segmentation Transformer, or SETR, is a Transformer-based segmentation model. The transformer-alone encoder treats an input image as a sequence of image patches represented by learned patch embedding, and transforms the sequence with global self-attention modeling for discriminative feature representation learning. Concretely, we first decompose an image into a grid of fixed-sized patches, forming a sequence of patches. With a linear embedding layer applied to the flattened pixel vectors of every patch, we then obtain a sequence of feature embedding vectors as the input to a transformer. Given the learned features from the encoder transformer, a decoder is then used to recover the original image resolution. Crucially there is no downsampling in spatial resolution but global context modeling at every layer of the encoder transformer.
Source: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with TransformersPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Semantic Segmentation | 2 | 33.33% |
Object Tracking | 1 | 16.67% |
Zero Shot Segmentation | 1 | 16.67% |
Decoder | 1 | 16.67% |
Medical Image Segmentation | 1 | 16.67% |