Mix-FFN is a feedforward layer used in the SegFormer architecture. ViT uses positional encoding (PE) to introduce the location information. However, the resolution of $\mathrm{PE}$ is fixed. Therefore, when the test resolution is different from the training one, the positional code needs to be interpolated and this often leads to dropped accuracy. To alleviate this problem, CPVT uses $3 \times 3$ Conv together with the PE to implement a data-driven PE. The authors of Mix-FFN argue that positional encoding is actually not necessary for semantic segmentation. Instead, they use Mix-FFN which considers the effect of zero padding to leak location information, by directly using a $3 \times 3$ Conv in the feed-forward network (FFN). Mix-FFN can be formulated as:
$$ \mathbf{x}_{\text {out }}=\operatorname{MLP}\left(\operatorname{GELU}\left(\operatorname{Conv}_{3 \times 3}\left(\operatorname{MLP}\left(\mathbf{x}_{i n}\right)\right)\right)\right)+\mathbf{x}_{i n} $$
where $\mathbf{x}_{i n}$ is the feature from a self-attention module. Mix-FFN mixes a $3 \times 3$ convolution and an MLP into each FFN.
Source: SegFormer: Simple and Efficient Design for Semantic Segmentation with TransformersPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Semantic Segmentation | 19 | 36.54% |
Image Segmentation | 3 | 5.77% |
Instance Segmentation | 3 | 5.77% |
Autonomous Driving | 2 | 3.85% |
Super-Resolution | 2 | 3.85% |
Change Detection | 2 | 3.85% |
Image Classification | 2 | 3.85% |
Object Detection | 2 | 3.85% |
Lesion Segmentation | 1 | 1.92% |
Component | Type |
|
---|---|---|
Convolution
|
Convolutions | |
Dense Connections
|
Feedforward Networks | |
GELU
|
Activation Functions | |
Residual Connection
|
Skip Connections |