Fast Non-Local Neural Networks with Spectral Residual Learning

MM '19: Proceedings of the 27th ACM International Conference on Multimedia 2019 · Lu Chi, Guiyu Tian, Yadong Mu, Lingxi Xie, Qi Tian ·

Effectively modeling long-range spatial correlation is crucial in context-sensitive visual computing tasks, such as human pose estimation and video classification. Enlarging receptive field is popularly adopted in building such non-local deep networks. However, current solutions, including dilation convolution or self-attention based operators, mostly suffer from either low computational efficacy or insufficient receptive field. This paper proposes spectral residual learning (SRL), a novel network architectural design for achieving fully global receptive field. A neural block that implements SRL has three key components: a local-to-global transform that projects some ordinary local features into a spectral domain, compiled operations in the spectral domain, and a global-to-local transform that converts all data back to the original local format. We show its equivalence to conducting residual learning in some spectral domain and carefully re-formulate a variety of neural layers into their spectral forms, such as ReLU or convolutions. The benefits of SRL is three-fold: first, all operations have global receptive field, namely any update affects all image positions. This can extract richer context information in various vision tasks; Secondly, the local-to-global / global-to-local transforms in SRL are defined by bi-linear unitary matrices, which is both computation and parameter economic; Lastly, SRL is a generic formulation, here instantiated by Fourier transform and real orthogonal matrix. We conduct comprehensive evaluations on two challenging tasks, including human pose estimation from images and video classification. All experiments clearly show performance improvement by large margins in comparison with conventional non-local network designs.

PDF