Vision Transformers

LocalViT

Introduced by Li et al. in LocalViT: Bringing Locality to Vision Transformers

LocalViT aims to introduce depthwise convolutions to enhance local features modeling capability of ViTs. The network, as shown in Figure (c), brings localist mechanism into transformers through the depth-wise convolution (denoted by "DW"). To cope with the convolution operation, the conversation between sequence and image feature map is added by "Seq2Img" and "Img2Seq". The computation is as follows:

$$ \mathbf{Y}^{r}=f\left(f\left(\mathbf{Z}^{r} \circledast \mathbf{W}_{1}^{r} \right) \circledast \mathbf{W}_d \right) \circledast \mathbf{W}_2^{r} $$

where $\mathbf{W}_{d} \in \mathbb{R}^{\gamma d \times 1 \times k \times k}$ is the kernel of the depth-wise convolution.

The input (sequence of tokens) is first reshaped to a feature map rearranged on a 2D lattice. Two convolutions along with a depth-wise convolution are applied to the feature map. The feature map is reshaped to a sequence of tokens which are used as by the self-attention of the network transformer layer.

Source: LocalViT: Bringing Locality to Vision Transformers

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Image Classification 1 100.00%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories