The Dataset is part of the KELM corpus
This is the Wikipedia text--Wikidata KG aligned corpus used to train the data-to-text generation model. Please note that this is a corpus generated with distant supervision and should not be used as gold standard for evaluation.
It consists of 3 files:
https://storage.googleapis.com/gresearch/kelm-corpus/updated-2021/quadruples-train.tsv https://storage.googleapis.com/gresearch/kelm-corpus/updated-2021/quadruples-validation.tsv https://storage.googleapis.com/gresearch/kelm-corpus/updated-2021/quadruples-test.tsv
Each file contains one example per line. Each example is a json object with three fields:
triples: A list of triples of the form (subject, relation, object). eg. (Person X, award received, Award Y). If the triple has a subproperty, then it is quadruple instead. eg. (Person X, Award Y, received on, Date Z). serialized triples: triples concatenated together as used for input to T5. The format is "<subject> <relation> <object>" where some subjects have multiple relations, e.g. "<subject> <relation1> <object1> <relation2> <object2> <relation3> <object3>". For more details on how these relations are grouped, please refer to the paper. sentence: The wikipedia sentence aligned to these triples.
The names, aliases and Wikidata Ids of the entities can be found in https://storage.googleapis.com/gresearch/kelm-corpus/updated-2021/entities.jsonl.