Augmenting Convolutional networks with attention-based aggregation
We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning. We replace the final average pooling by an attention-based aggregation layer akin to a single transformer block, that weights how the patches are involved in the classification decision. We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth). In contrast with a pyramidal design, this architecture family maintains the input patch resolution across all the layers. It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption, as shown by our experiments on various computer vision tasks: object classification, image segmentation and detection.
PDF AbstractCode
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Semantic Segmentation | ADE20K | PatchConvNet-S60 (UperNet) | Validation mIoU | 49.3 | # 127 | ||
Semantic Segmentation | ADE20K | PatchConvNet-L120 (UperNet) | Validation mIoU | 52.9 | # 78 | ||
Semantic Segmentation | ADE20K | PatchConvNet-B120 (UperNet) | Validation mIoU | 52.8 | # 80 | ||
Semantic Segmentation | ADE20K | PatchConvNet-B60 (UperNet) | Validation mIoU | 51.1 | # 93 | ||
Semantic Segmentation | ADE20K val | PatchConvNet-L120 (UperNet) | mIoU | 52.9 | # 38 | ||
Semantic Segmentation | ADE20K val | PatchConvNet-B120 (UperNet) | mIoU | 52.8 | # 39 | ||
Semantic Segmentation | ADE20K val | PatchConvNet-B60 (UperNet) | mIoU | 51.1 | # 43 | ||
Semantic Segmentation | ADE20K val | PatchConvNet-S60 (UperNet) | mIoU | 49.3 | # 55 | ||
Object Detection | COCO minival | PatchConvNet-S120 (Mask R-CNN) | box AP | 47.0 | # 93 | ||
Object Detection | COCO minival | PatchConvNet-S60 (Mask R-CNN) | box AP | 46.4 | # 97 | ||
Image Classification | ImageNet | PatchConvNet-B60-21k-384 | Top 1 Accuracy | 86.5% | # 135 | ||
Number of params | 99.4M | # 866 | |||||
Image Classification | ImageNet | PatchConvNet-B120 | Top 1 Accuracy | 84.1% | # 325 | ||
Number of params | 188.6M | # 888 | |||||
Image Classification | ImageNet | PatchConvNet-B60 | Top 1 Accuracy | 83.5% | # 391 | ||
Number of params | 99.4M | # 866 | |||||
Image Classification | ImageNet | PatchConvNet-S120 | Top 1 Accuracy | 83.2% | # 413 | ||
Number of params | 47.7M | # 712 | |||||
Image Classification | ImageNet | PatchConvNet-S60 | Top 1 Accuracy | 82.1% | # 525 | ||
Number of params | 25.2M | # 593 | |||||
Image Classification | ImageNet | PatchConvNet-S60-21k-512 | Top 1 Accuracy | 85.4% | # 221 | ||
Number of params | 25.2M | # 593 | |||||
Image Classification | ImageNet | PatchConvNet-L120-21k-384 | Top 1 Accuracy | 87.1% | # 103 | ||
Number of params | 334.3M | # 920 |