Tilde's Parallel Corpus Filtering Methods for WMT 2018

WS 2018  ·  M{\=a}rcis Pinnis ·

The paper describes parallel corpus filtering methods that allow reducing noise of noisy {``}parallel{''} corpora from a level where the corpora are not usable for neural machine translation training (i.e., the resulting systems fail to achieve reasonable translation quality; well below 10 BLEU points) up to a level where the trained systems show decent (over 20 BLEU points on a 10 million word dataset and up to 30 BLEU points on a 100 million word dataset). The paper also documents Tilde{'}s submissions to the WMT 2018 shared task on parallel corpus filtering.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here