X-volution: On the unification of convolution and self-attention

4 Jun 2021  ·  Xuanhong Chen, Hang Wang, Bingbing Ni ·

Convolution and self-attention are acting as two fundamental building blocks in deep neural networks, where the former extracts local image features in a linear way while the latter non-locally encodes high-order contextual relationships. Though essentially complementary to each other, i.e., first-/high-order, stat-of-the-art architectures, i.e., CNNs or transformers lack a principled way to simultaneously apply both operations in a single computational module, due to their heterogeneous computing pattern and excessive burden of global dot-product for visual tasks. In this work, we theoretically derive a global self-attention approximation scheme, which approximates a self-attention via the convolution operation on transformed features. Based on the approximated scheme, we establish a multi-branch elementary module composed of both convolution and self-attention operation, capable of unifying both local and non-local feature interaction. Importantly, once trained, this multi-branch module could be conditionally converted into a single standard convolution operation via structural re-parameterization, rendering a pure convolution styled operator named X-volution, ready to be plugged into any modern networks as an atomic operation. Extensive experiments demonstrate that the proposed X-volution, achieves highly competitive visual understanding improvements (+1.2% top-1 accuracy on ImageNet classification, +1.7 box AP and +1.5 mask AP on COCO detection and segmentation).

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Object Detection COCO minival Faster R-CNN (FPN, X-volution) box AP 42.8 # 135
AP50 64 # 44
AP75 46.4 # 51
APS 26.9 # 32
APM 46 # 43
APL 55 # 58
Instance Segmentation COCO minival Mask R-CNN (FPN, X-volution, SA) mask AP 37.2 # 81
APL 53.1 # 9
APM 40 # 11
APS 19.2 # 10
Image Classification ImageNet ResNet-50 (X-volution, stage3) Top 1 Accuracy 76.6% # 839
Hardware Burden None # 1
Operations per network pass None # 1
Image Classification ImageNet ResNet-34 (X-volution, stage3) Top 1 Accuracy 75% # 888

Methods