Feedforward Networks

Mix-FFN is a feedforward layer used in the SegFormer architecture. ViT uses positional encoding (PE) to introduce the location information. However, the resolution of $\mathrm{PE}$ is fixed. Therefore, when the test resolution is different from the training one, the positional code needs to be interpolated and this often leads to dropped accuracy. To alleviate this problem, CPVT uses $3 \times 3$ Conv together with the PE to implement a data-driven PE. The authors of Mix-FFN argue that positional encoding is actually not necessary for semantic segmentation. Instead, they use Mix-FFN which considers the effect of zero padding to leak location information, by directly using a $3 \times 3$ Conv in the feed-forward network (FFN). Mix-FFN can be formulated as:

$$ \mathbf{x}_{\text {out }}=\operatorname{MLP}\left(\operatorname{GELU}\left(\operatorname{Conv}_{3 \times 3}\left(\operatorname{MLP}\left(\mathbf{x}_{i n}\right)\right)\right)\right)+\mathbf{x}_{i n} $$

where $\mathbf{x}_{i n}$ is the feature from a self-attention module. Mix-FFN mixes a $3 \times 3$ convolution and an MLP into each FFN.

Source: SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Papers


Paper Code Results Date Stars

Categories