Vision Transformers

nnFormer, or not-another transFormer, is a semantic segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution. Firstly, a light-weight convolutional embedding layer ahead is used ahead of transformer blocks. In comparison to directly flattening raw pixels and applying 1D pre-processing, the convolutional embedding layer encodes precise (i.e., pixel-level) spatial information and provide low-level yet high-resolution 3D features. After the embedding block, transformer and convolutional down-sampling blocks are interleaved to fully entangle long-term dependencies with high-level and hierarchical object concepts at various scales, which helps improve the generalization ability and robustness of learned representations.

Source: nnFormer: Interleaved Transformer for Volumetric Segmentation


Paper Code Results Date Stars


Task Papers Share
Medical Image Segmentation 1 100.00%