It consists of an extensive collection of a high quality cross-lingual fact-to-text dataset in 11 languages: Assamese (as), Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Oriya (or), Punjabi (pa), Tamil (ta), Telugu (te), and monolingual dataset in English (en). This is the Wikipedia text <--> Wikidata KG aligned corpus used to train the data-to-text generation model. The Train & validation splits are created using distant supervision methods and Test data is generated through human annotations.

Data Format

Dataset is publicly available here. Each directory contains language specific dataset (refered through language ISO code) and contains of three files:

  • train.jsonl
  • test.jsonl
  • val.jsonl

Data stored in the above files are of JSON Line (jsonl) format.

Record structure (JSON structure)

Each record consist of the following entries:

  • sentence (string) : Native language wikipedia sentence. (non-native language strings were removed.)
  • facts (List[Dict]) : List of facts associated with the sentence where each fact is stored as dictionary.
  • language (string) : Language identifier.

The facts key contains list of facts where each facts is stored as dictionary. A single record within fact list contains following entries:

  • subject (string) : central entity.
  • object (string) : entity or a piece of information about the subject.
  • predicate (string) : relationship that connects the subject and the object.
  • qualifiers (List[Dict]) : It provide additional information about the fact, is stored as list of qualifier where each record is a dictionary. The dictionary contains two keys: qualifier_predicate to represent property of qualifer and qualifier_object to store value for the qualifier's predicate.

Examples

Example from English dataset

{
  "sentence": "Mark Paul Briers (born 21 April 1968) is a former English cricketer.",
  "facts": [
    {
      "subject": "Mark Briers",
      "predicate": "date of birth",
      "object": "21 April 1968",
      "qualifiers": []
    },
    {
      "subject": "Mark Briers",
      "predicate": "occupation",
      "object": "cricketer",
      "qualifiers": []
    },
    {
      "subject": "Mark Briers",
      "predicate": "country of citizenship",
      "object": "United Kingdom",
      "qualifiers": []
    }
  ],
  "language": "en"
}

Example from one of the low-resource languages (i.e. Hindi)

{
  "sentence": "बोरिस पास्तेरनाक १९५८ में साहित्य के क्षेत्र में नोबेल पुरस्कार विजेता रहे हैं।",
  "facts": [
    {
      "subject": "Boris Pasternak",
      "predicate": "nominated for",
      "object": "Nobel Prize in Literature",
      "qualifiers": [
        {
          "qualifier_predicate": "point in time",
          "qualifier_subject": "1958"
        }
      ]
    }
  ],
  "language": "hi"
}

Papers


Paper Code Results Date Stars

Dataset Loaders


Tasks


Similar Datasets