Detecting Various Types of Noise for Neural Machine Translation

Findings (ACL) 2022 · Christian Herold, Jan Rosendahl, Joris Vanvinckenroye, Hermann Ney ·

The filtering and/or selection of training data is one of the core aspects to be considered when building a strong machine translation system.In their influential work, Khayrallah and Koehn (2018) investigated the impact of different types of noise on the performance of machine translation systems.In the same year the WMT introduced a shared task on parallel corpus filtering, which went on to be repeated in the following years, and resulted in many different filtering approaches being proposed.In this work we aim to combine the recent achievements in data filtering with the original analysis of Khayrallah and Koehn (2018) and investigate whether state-of-the-art filtering systems are capable of removing all the suggested noise types.We observe that most of these types of noise can be detected with an accuracy of over 90% by modern filtering systems when operating in a well studied high resource setting.However, we also find that when confronted with more refined noise categories or when working with a less common language pair, the performance of the filtering systems is far from optimal, showing that there is still room for improvement in this area of research.

PDF Abstract