Focal Transformers

Introduced by Yang et al. in Focal Self-attention for Local-Global Interactions in Vision Transformers

The focal self-attention is built to make Transformer layers scalable to high-resolution inputs. Instead of attending all tokens at fine-grain, the approach attends the fine-grain tokens only locally, but the summarized ones globally. As such, it can cover as many regions as standard self-attention but with much less cost. An image is first partitioned into patches, resulting in visual tokens. Then a patch embedding layer, consisting of a convolutional layer with filter and stride of same size, to project the patches into hidden features. This spatial feature map in then passed to four stages of focal Transformer blocks. Each focal Transformer block consists of $N_i$ focal Transformer layers. Patch embedding layers are used in between to reduce spatial size of feature map by factor 2, while feature dimension increased by 2.

Source: Focal Self-attention for Local-Global Interactions in Vision Transformers

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Classification	2	22.22%
Object Detection	2	22.22%
Semantic Segmentation	2	22.22%
Depth Estimation	1	11.11%
Monocular Depth Estimation	1	11.11%
Instance Segmentation	1	11.11%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision Transformers