Very Deep Transformers for Neural Machine Translation

18 Aug 2020  ·  Xiaodong Liu, Kevin Duh, Liyuan Liu, Jianfeng Gao ·

We explore the application of very deep Transformer models for Neural Machine Translation (NMT). Using a simple yet effective initialization technique that stabilizes training, we show that it is feasible to build standard Transformer-based models with up to 60 encoder layers and 12 decoder layers... These deep models outperform their baseline 6-layer counterparts by as much as 2.5 BLEU, and achieve new state-of-the-art benchmark results on WMT14 English-French (43.8 BLEU and 46.4 BLEU with back-translation) and WMT14 English-German (30.1 BLEU).The code and trained models will be publicly available at: read more

PDF Abstract


Results from the Paper

 Ranked #1 on Machine Translation on WMT2014 English-French (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Machine Translation WMT2014 English-French Transformer (6 0 layers) BLEU 41.8 # 2
Machine Translation WMT2014 English-French Transformer+BT (ADMIN init) BLEU score 46.4 # 1
SacreBLEU 44.4 # 1
Machine Translation WMT2014 English-French Transformer (ADMIN init) BLEU score 43.8 # 4
SacreBLEU 41.8 # 3
Machine Translation WMT2014 English-German Transformer (ADMIN init) BLEU score 30.1 # 8
SacreBLEU 29.5 # 5