Tuformer: Data-Driven Design of Expressive Transformer by Tucker Tensor Representation

ICLR 2022 · Xiaoyu Liu, Jiahao Su, Furong Huang ·

Transformers are neural network architectures that achieve remarkable performance in many areas. However, the core component of Transformers, multi-head self-attention (MHSA), is mainly derived from heuristics, and the interactions across its components are not well understood. To address the problem, we first introduce a mathematically rigorous yet intuitive tensor diagram representation of MHSA. Guided by tensor diagram representations, we formulate a design space where we can analyze the expressive power of the network structure, providing new directions and possibilities for enhanced performance. We then propose a novel design, namely Tucker Transformers (Tuformers), inspired by a variant of Tucker representation with a guaranteed higher expressive power than MHSA. Unlike vanilla Transformer models, where the number of heads is a pre-defined fixed constant, Tuformer's structure is data-driven, and the number of heads is trainable. Training of Tuformers could be made very efficient as it allows initialization from existing pre-trained Transformer models. We test Tuformers on various tasks across multiple domains and show competitive results under a wide range of model sizes.

PDF Abstract