The University of Helsinki Submission to the WMT19 Parallel Corpus Filtering Task

WS 2019 · Ra{\'u}l V{\'a}zquez, Umut Sulubacak, J{\"o}rg Tiedemann ·

This paper describes the University of Helsinki Language Technology group{'}s participation in the WMT 2019 parallel corpus filtering task. Our scores were produced using a two-step strategy. First, we individually applied a series of filters to remove the {`}bad{'} quality sentences. Then, we produced scores for each sentence by weighting these features with a classification model. This methodology allowed us to build a simple and reliable system that is easily adaptable to other language pairs.

PDF Abstract