Reducing the Search Space for Parallel Sentences in Comparable Corpora

LREC 2020  ·  R{\'e}mi Cardon, Natalia Grabar ·

This paper describes and evaluates simple techniques for reducing the research space for parallel sentences in monolingual comparable corpora. Initially, when searching for parallel sentences between two comparable documents, all the possible sentence pairs between the documents have to be considered, which introduces a great degree of imbalance between parallel pairs and non-parallel pairs. This is a problem because even with a high performing algorithm, a lot of noise will be present in the extracted results, thus introducing a need for an extensive and costly manual check phase. We work on a manually annotated subset obtained from a French comparable corpus and show how we can drastically reduce the number of sentence pairs that have to be fed to a classifier so that the results can be manually handled.

PDF Abstract
No code implementations yet. Submit your code now

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here