Coping with Noisy Training Data Labels in Paraphrase Detection

WNUT (ACL) 2021 · Teemu Vahtola, Mathias Creutz, Eetu Sjöblom, Sami Itkonen ·

We present new state-of-the-art benchmarks for paraphrase detection on all six languages in the Opusparcus sentential paraphrase corpus: English, Finnish, French, German, Russian, and Swedish. We reach these baselines by fine-tuning BERT. The best results are achieved on smaller and cleaner subsets of the training sets than was observed in previous research. Additionally, we study a translation-based approach that is competitive for the languages with more limited and noisier training data.

PDF Abstract