Alleviating the Inequality of Attention Heads for Neural Machine Translation

Recent studies show that the attention heads in Transformer are not equal. We relate this phenomenon to the imbalance training of multi-head attention and the model dependence on specific heads. To tackle this problem, we propose a simple masking method: HeadMask, in two specific ways. Experiments show that translation improvements are achieved on multiple language pairs. Subsequent empirical analyses also support our assumption and confirm the effectiveness of the method.

PDF Abstract COLING 2022 PDF COLING 2022 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Machine Translation IWSLT2015 Vietnamese-English HeadMask (Random-18) BLEU 26.85 # 1
Machine Translation IWSLT2015 Vietnamese-English HeadMask (Impt-18) BLEU 26.36 # 2
Machine Translation WMT2016 Romanian-English HeadMask (Random-18) BLEU score 32.85 # 12
Machine Translation WMT2016 Romanian-English HeadMask (Impt-18) BLEU score 32.95 # 9
Machine Translation WMT2017 Turkish-English HeadMask (Random-18) BLEU score 17.56 # 1
Machine Translation WMT2017 Turkish-English HeadMask (Impt-18) BLEU score 17.48 # 2

Methods