Scale down Transformer by Grouping Features for a Lightweight Character-level Language Model
This paper introduces a method that efficiently reduces the computational cost and parameter size of Transformer. The proposed model, refer to as Group-Transformer, splits feature space into multiple groups, factorizes the calculation paths, and reduces computations for the group interaction. Extensive experiments on two benchmark tasks, enwik8 and text8, prove our model{'}s effectiveness and efficiency in small-scale Transformers. To the best of our knowledge, Group-Transformer is the first attempt to design Transformer with the group strategy, widely used for efficient CNN architectures.
PDF Abstract