CodRED is the first human-annotated cross-document relation extraction (RE) dataset, aiming to test the RE systems’ ability of knowledge acquisition in the wild. CodRED has the following features:

  • it requires natural language understanding in different granularity, including coarse-grained document retrieval, as well as fine-grained cross-document multi-hop reasoning;

  • it contains 30,504 relational facts associated with 210,812 reasoning text paths, as well as enjoys a broad range of balanced relations, and long documents in diverse topics;

  • it provides strong supervision about the reasoning text paths for predicting the relation, to help guide RE systems to perform meaningful and interpretable reasoning;

  • it contains adversarially-created hard NA instances to avoid RE models to predict relations by inferring from entity names instead of text information.


