Prompsit's submission to WMT 2018 Parallel Corpus Filtering shared task

WS 2018 · V{\'\i}ctor M. S{\'a}nchez-Cartagena, Marta Ba{\~n}{\'o}n, Sergio Ortiz-Rojas, Gema Ram{\'\i}rez ·

This paper describes Prompsit Language Engineering{'}s submissions to the WMT 2018 parallel corpus filtering shared task. Our four submissions were based on an automatic classifier for identifying pairs of sentences that are mutual translations. A set of hand-crafted hard rules for discarding sentences with evident flaws were applied before the classifier. We explored different strategies for achieving a training corpus with diverse vocabulary and fluent sentences: language model scoring, an active-learning-inspired data selection algorithm and n-gram saturation. Our submissions were very competitive in comparison with other participants on the 100 million word training corpus.

PDF Abstract