Vision Transformers

Convolution-enhanced image Transformer

Introduced by Yuan et al. in Incorporating Convolution Designs into Visual Transformers

Convolution-enhanced image Transformer (CeiT) combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Three modifications are made to the original Transformer: 1) instead of the straightforward tokenization from raw input images, we design an Image-to-Tokens (I2T) module that extracts patches from generated low-level features; 2) the feed-froward network in each encoder block is replaced with a Locally-enhanced Feed-Forward (LeFF) layer that promotes the correlation among neighbouring tokens in the spatial dimension; 3) a Layer-wise Class token Attention (LCA) is attached at the top of the Transformer that utilizes the multi-level representations.

Source: Incorporating Convolution Designs into Visual Transformers

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Image Classification 2 50.00%
Human-Object Interaction Detection 1 25.00%
Object Detection 1 25.00%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories