Impact of Corpora Quality on Neural Machine Translation

19 Oct 2018  ·  Matīss Rikters ·

Large parallel corpora that are automatically obtained from the web, documents or elsewhere often exhibit many corrupted parts that are bound to negatively affect the quality of the systems and models that learn from these corpora. This paper describes frequent problems found in data and such data affects neural machine translation systems, as well as how to identify and deal with them. The solutions are summarised in a set of scripts that remove problematic sentences from input corpora.

PDF Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Machine Translation WMT 2017 English-Latvian Transformer trained on highly filtered data BLEU 22.89 # 1
Machine Translation WMT 2017 Latvian-English Transformer trained on highly filtered data BLEU 24.37 # 1
Machine Translation WMT 2018 English-Finnish Transformer trained on highly filtered data BLEU 17.40 # 1
Machine Translation WMT 2018 Finnish-English Transformer trained on highly filtered data BLEU 24.00 # 2

Methods


No methods listed for this paper. Add relevant methods here