Collaboratively Annotating Multilingual Parallel Corpora in the Biomedical Domain---some MANTRAs

LREC 2014 · Johannes Hellrich, Simon Clematide, Udo Hahn, Dietrich Rebholz-Schuhmann ·

The coverage of multilingual biomedical resources is high for the English language, yet sparse for non-English languagesâ€•an observation which holds for seemingly well-resourced, yet still dramatically low-resourced ones such as Spanish, French or German but even more so for really under-resourced ones such as Dutch. We here present experimental results for automatically annotating parallel corpora and simultaneously acquiring new biomedical terminology for these under-resourced non-English languages on the basis of two types of language resources, namely parallel corpora (i.e. full translation equivalents at the document unit level) and (admittedly deficient) multilingual biomedical terminologies, with English as their anchor language. We automatically annotate these parallel corpora with biomedical named entities by an ensemble of named entity taggers and harmonize non-identical annotations the outcome of which is a so-called silver standard corpus. We conclude with an empirical assessment of this approach to automatically identify both known and new terms in multilingual corpora.

PDF Abstract