MultiTACRED is a multilingual version of the large-scale TAC Relation Extraction Dataset. It covers 12 typologically diverse languages from 9 language families, and was created by the Speech & Language Technology group of DFKI by machine-translating the instances of the original TACRED dataset and automatically projecting their entity annotations. For details of the original TACRED's data collection and annotation process, see the Stanford paper. Translations are syntactically validated by checking the correctness of the XML tag markup. Any translations with an invalid tag structure, e.g. missing or invalid head or tail tag pairs, are discarded (on average, 2.3% of the instances).

Languages covered are: Arabic, Chinese, Finnish, French, German, Hindi, Hungarian, Japanese, Polish, Russian, Spanish, Turkish. Intended use is supervised relation classification. Audience - researchers.

Please see our ACL paper for full details.

Papers


Paper Code Results Date Stars

Dataset Loaders


Tasks