It consists of an extensive collection of a high quality cross-lingual fact-to-text dataset in 11 languages: Assamese (as), Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Oriya (or), Punjabi (pa), Tamil (ta), Telugu (te), and monolingual dataset in English (en). This is the Wikipedia text <--> Wikidata KG aligned corpus used to train the data-to-text generation model. The Train & validation splits are created using distant supervision methods and Test data is generated through human annotations.
Dataset is publicly available here. Each directory contains language specific dataset (refered through language ISO code) and contains of three files:
Data stored in the above files are of JSON Line (jsonl) format.
Each record consist of the following entries:
facts
(List[Dict]) : List of facts associated with the sentence where each fact is stored as dictionary.The facts
key contains list of facts where each facts is stored as dictionary. A single record within fact list contains following entries:
qualifier_predicate
to represent property of qualifer and qualifier_object
to store value for the qualifier's predicate. Example from English dataset
{
"sentence": "Mark Paul Briers (born 21 April 1968) is a former English cricketer.",
"facts": [
{
"subject": "Mark Briers",
"predicate": "date of birth",
"object": "21 April 1968",
"qualifiers": []
},
{
"subject": "Mark Briers",
"predicate": "occupation",
"object": "cricketer",
"qualifiers": []
},
{
"subject": "Mark Briers",
"predicate": "country of citizenship",
"object": "United Kingdom",
"qualifiers": []
}
],
"language": "en"
}
Example from one of the low-resource languages (i.e. Hindi)
{
"sentence": "बोरिस पास्तेरनाक १९५८ में साहित्य के क्षेत्र में नोबेल पुरस्कार विजेता रहे हैं।",
"facts": [
{
"subject": "Boris Pasternak",
"predicate": "nominated for",
"object": "Nobel Prize in Literature",
"qualifiers": [
{
"qualifier_predicate": "point in time",
"qualifier_subject": "1958"
}
]
}
],
"language": "hi"
}
Paper | Code | Results | Date | Stars |
---|