Lessons on Parameter Sharing across Layers in Transformers

13 Apr 2021  ·  Sho Takase, Shun Kiyono ·

We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares parameters for one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to increase the efficiency in the computational time. We propose three strategies: Sequence, Cycle, and Cycle (rev) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in the parameter size and computational time. Moreover, we indicate that the proposed strategies are also effective in the configuration where we use many training data such as the recent WMT competition.

PDF Abstract

Results from the Paper

 Ranked #1 on Machine Translation on WMT2014 English-German (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Machine Translation WMT2014 English-German Transformer Cycle (Rev) BLEU score 35.14 # 1
SacreBLEU 33.54 # 2
Hardware Burden None # 1
Operations per network pass None # 1


No methods listed for this paper. Add relevant methods here