Dataset Description

EUROPA is a dataset designed for training and evaluating multilingual keyphrase generation models in the legal domain. It consists of legal judgments from the Court of Justice of the European Union (EU) and includes instances in all 24 official EU languages.

Key Features: Multilingual: Covers 24 official EU languages. Domain-Specific: Focuses on legal documents. Source: Derived from Court of Justice of the European Union judgments.

  • Curated by: N3 team
  • Languages: French, German, English, Italian, Dutch, Greek, Danish, Portuguese, Spanish, Swedish, Finnish, Lithuanian, Estonian, Czech, Hungarian, Latvian, Slovenian, Polish, Maltese, Slovak, Romanian, Bulgarian, Croatian, Irish
  • License: MIT License

Dataset Sources

  • Paper: https://arxiv.org/abs/2403.00252

Dataset Structure

  • celex_id: CELEX identifier inherited from CJEU. Different translated versions of the same judgment share the same celex_id. If you wish to set a unique identifier for each instance, you can concatenate lang and celex_id values;
  • lang: ISO 639-1 language code;
  • input: judgment transcription or translation;
  • keyphrases: reference keyphrases drafted by the CJEU.

As explained in our paper, the dataset is split chronologically for assessing temporal generalization of models: - training set: 1957 to 2010 (131 076 instances); - validation set: 2011 to 2015 (63 373 instances); - test set: 2016 to 2023 (90 508 instances).

Citation

@article{salaun2024europa,
  title={EUROPA: A Legal Multilingual Keyphrase Generation Dataset},
  author={Sala{\"u}n, Olivier and Piedboeuf, Fr{\'e}d{\'e}ric and Le Berre, Guillaume and Hermelo, David Alfonso and Langlais, Philippe},
  journal={arXiv preprint arXiv:2403.00252},
  year={2024}
}

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks