Impact of Corpora Quality on Neural Machine Translation

19 Oct 2018Matīss Rikters

Large parallel corpora that are automatically obtained from the web, documents or elsewhere often exhibit many corrupted parts that are bound to negatively affect the quality of the systems and models that learn from these corpora. This paper describes frequent problems found in data and such data affects neural machine translation systems, as well as how to identify and deal with them... (read more)

PDF Abstract

Evaluation results from the paper


Task Dataset Model Metric name Metric value Global rank Compare
Machine Translation WMT 2017 English-Latvian Transformer trained on highly filtered data BLEU 22.89 # 1
Machine Translation WMT 2017 Latvian-English Transformer trained on highly filtered data BLEU 24.37 # 1
Machine Translation WMT 2018 English-Finnish Transformer trained on highly filtered data BLEU 17.40 # 1
Machine Translation WMT 2018 Finnish-English Transformer trained on highly filtered data BLEU 24.00 # 1