The RWTH Aachen University Filtering System for the WMT 2018 Parallel Corpus Filtering Task

This paper describes the submission of RWTH Aachen University for the De→En parallel corpus filtering task of the \textit{EMNLP 2018 Third Conference on Machine Translation} (WMT 2018). We use several rule-based, heuristic methods to preselect sentence pairs. These sentence pairs are scored with count-based and neural systems as language and translation models. In addition to single sentence-pair scoring, we further implement a simple redundancy removing heuristic. Our best performing corpus filtering system relies on recurrent neural language models and translation models based on the transformer architecture. A model trained on 10M randomly sampled tokens reaches a performance of 9.2{\%} BLEU on newstest2018. Using our filtering and ranking techniques we achieve 34.8{\%} BLEU.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods