Subformer: A Parameter Reduced Transformer

1 Jan 2021  ·  Machel Reid, Edison Marrese-Taylor, Yutaka Matsuo ·

The advent of the Transformer can arguably be described as a driving force behind many of the recent advances in natural language processing. However, despite their sizeable performance improvements, as recently shown, the model is severely over-parameterized, being parameter inefficient and computationally expensive to train... Inspired by the success of parameter-sharing in pre-trained deep contextualized word representation encoders, we explore parameter-sharing methods in Transformers, with a specific focus on encoder-decoder models for sequence-to-sequence tasks such as Machine Translation. We perform an analysis of different parameter sharing/reduction methods and develop the Subformer, a parameter efficient Transformer-based model which combines the newly proposed Sandwich-style parameter sharing technique and self-attentive embedding factorization (SAFE). Experiments on machine translation, abstractive summarization, and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters. On the WMT'14 English-German test set, we show we can perform equally well, and even sometimes outperform (+0.1 BLEU score) the Transformer-base model while using 40% less parameters. We also perform equally well as Transformer-big with 40% less parameters and outperform the model by 0.7 BLEU with 12M less parameters. We also outperform the standard Transformer-XL model, achieving a significant 3.6 lower perplexity with 37% fewer parameters. read more

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Abstractive Text Summarization CNN / Daily Mail Subformer-base ROUGE-1 40.9 # 24
ROUGE-2 18.3 # 22
ROUGE-L 37.7 # 24
Language Modelling WikiText-103 Subformer Test perplexity 20.39 # 23
Number of params 96M # 32
Machine Translation WMT2014 English-German Subformer-xlarge BLEU score 29.3 # 20
Hardware Burden None # 1
Operations per network pass None # 1