Augmenting Convolutional networks with attention-based aggregation
We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning. We replace the final average pooling by an attention-based aggregation layer akin to a single transformer block, that weights how the patches are involved in the classification decision. We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth). In contrast with a pyramidal design, this architecture family maintains the input patch resolution across all the layers. It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption, as shown by our experiments on various computer vision tasks: object classification, image segmentation and detection.
PDF AbstractCode
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Semantic Segmentation | ADE20K | PatchConvNet-S60 (UperNet) | Validation mIoU | 49.3 | # 134 | |
Semantic Segmentation | ADE20K | PatchConvNet-L120 (UperNet) | Validation mIoU | 52.9 | # 84 | |
Semantic Segmentation | ADE20K | PatchConvNet-B120 (UperNet) | Validation mIoU | 52.8 | # 86 | |
Semantic Segmentation | ADE20K | PatchConvNet-B60 (UperNet) | Validation mIoU | 51.1 | # 100 | |
Semantic Segmentation | ADE20K val | PatchConvNet-L120 (UperNet) | mIoU | 52.9 | # 41 | |
Semantic Segmentation | ADE20K val | PatchConvNet-B120 (UperNet) | mIoU | 52.8 | # 43 | |
Semantic Segmentation | ADE20K val | PatchConvNet-B60 (UperNet) | mIoU | 51.1 | # 47 | |
Semantic Segmentation | ADE20K val | PatchConvNet-S60 (UperNet) | mIoU | 49.3 | # 59 | |
Object Detection | COCO minival | PatchConvNet-S120 (Mask R-CNN) | box AP | 47.0 | # 99 | |
Object Detection | COCO minival | PatchConvNet-S60 (Mask R-CNN) | box AP | 46.4 | # 104 | |
Image Classification | ImageNet | PatchConvNet-B60-21k-384 | Top 1 Accuracy | 86.5% | # 136 | |
Number of params | 99.4M | # 940 | ||||
Image Classification | ImageNet | PatchConvNet-B120 | Top 1 Accuracy | 84.1% | # 346 | |
Number of params | 188.6M | # 966 | ||||
Image Classification | ImageNet | PatchConvNet-B60 | Top 1 Accuracy | 83.5% | # 420 | |
Number of params | 99.4M | # 940 | ||||
Image Classification | ImageNet | PatchConvNet-S120 | Top 1 Accuracy | 83.2% | # 447 | |
Number of params | 47.7M | # 769 | ||||
Image Classification | ImageNet | PatchConvNet-S60 | Top 1 Accuracy | 82.1% | # 573 | |
Number of params | 25.2M | # 645 | ||||
Image Classification | ImageNet | PatchConvNet-S60-21k-512 | Top 1 Accuracy | 85.4% | # 227 | |
Number of params | 25.2M | # 645 | ||||
Image Classification | ImageNet | PatchConvNet-L120-21k-384 | Top 1 Accuracy | 87.1% | # 101 | |
Number of params | 334.3M | # 1004 |