Vision Transformers

NesT stacks canonical transformer layers to conduct local self-attention on every image block independently, and then "nests" them hierarchically. Coupling of processed information between spatially adjacent blocks is achieved through a proposed block aggregation between every two hierarchies. The overall hierarchical structure can be determined by two key hyper-parameters: patch size $S × S$ and number of block hierarchies $T_d$. All blocks inside each hierarchy share one set of parameters. Given input of image, each image is linearly projected to an embedding. All embeddings are partitioned to blocks and flattened to generate final input. Each transformer layers is composed of a multi-head self attention (MSA) layer followed by a feed-forward fully-connected network (FFN) with skip-connection and Layer normalization. Positional embeddings are added to encode spatial information before feeding into the block. Lastly, a nested hierarchy with block aggregation is built -- every four spatially connected blocks are merged into one.

Source: Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding


Paper Code Results Date Stars


Task Papers Share
BIG-bench Machine Learning 1 14.29%
Anatomy 1 14.29%
Audio Denoising 1 14.29%
Denoising 1 14.29%
Speech Recognition 1 14.29%
Image Classification 1 14.29%
Image Generation 1 14.29%


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign