Noisy Parallel Corpus Filtering through Projected Word Embeddings

WS 2019 · Murathan Kurfal{\i}, Robert {\"O}stling ·

We present a very simple method for parallel text cleaning of low-resource languages, based on projection of word embeddings trained on large monolingual corpora in high-resource languages. In spite of its simplicity, we approach the strong baseline system in the downstream machine translation evaluation.

PDF Abstract