HunOr: A Hungarian---Russian Parallel Corpus

LREC 2012 · Martina Katalin Szab{\'o}, Veronika Vincze, Istv{\'a}n Nagy T. ·

In this paper, we present HunOr, the first multi-domain Hungarianâ€•Russian parallel corpus. Some of the corpus texts have been manually aligned and split into sentences, besides, named entities also have been annotated while the other parts are automatically aligned at the sentence level and they are POS-tagged as well. The corpus contains texts from the domains literature, official language use and science, however, we would like to add texts from the news domain to the corpus. In the future, we are planning to carry out a syntactic annotation of the HunOr corpus, which will further enhance the usability of the corpus in various NLP fields such as transfer-based machine translation or cross lingual information retrieval.

PDF Abstract