XAlign Dataset | Papers With Code

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

It consists of an extensive collection of a high quality cross-lingual fact-to-text dataset in 11 languages: Assamese (as), Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Oriya (or), Punjabi (pa), Tamil (ta), Telugu (te), and monolingual dataset in English (en). This is the Wikipedia text <--> Wikidata KG aligned corpus used to train the data-to-text generation model. The Train & validation splits are created using distant supervision methods and Test data is generated through human annotations.

## Data Format
Dataset is publicly available [here](https://github.com/tushar117/XAlign). Each directory contains language specific dataset (refered through language ISO code) and contains of three files:

- train.jsonl
- test.jsonl
- val.jsonl

Data stored in the above files are of JSON Line (jsonl) format.

### Record structure (JSON structure)
Each record consist of the following entries:

- sentence (string) : Native language wikipedia sentence. (non-native language strings were removed.) 
- `facts` (List[Dict]) : List of facts associated with the sentence where each fact is stored as dictionary.
- language (string) : Language identifier.

The `facts` key contains list of facts where each facts is stored as dictionary. A single record within fact list contains following entries:

- subject (string) : central entity.
- object (string) : entity or a piece of information about the subject.
- predicate (string) : relationship that connects the subject and the object.
- qualifiers (List[Dict]) : It provide additional information about the fact, is stored as list of 
qualifier where each record is a dictionary. The dictionary contains two keys: `qualifier_predicate` to represent property of qualifer and `qualifier_object` to store value for the qualifier's predicate.

### Examples
Example from English dataset
```
{
  "sentence": "Mark Paul Briers (born 21 April 1968) is a former English cricketer.",
  "facts": [
    {
      "subject": "Mark Briers",
      "predicate": "date of birth",
      "object": "21 April 1968",
      "qualifiers": []
    },
    {
      "subject": "Mark Briers",
      "predicate": "occupation",
      "object": "cricketer",
      "qualifiers": []
    },
    {
      "subject": "Mark Briers",
      "predicate": "country of citizenship",
      "object": "United Kingdom",
      "qualifiers": []
    }
  ],
  "language": "en"
}
```
Example from one of the low-resource languages (i.e. Hindi)
```
{
  "sentence": "बोरिस पास्तेरनाक १९५८ में साहित्य के क्षेत्र में नोबेल पुरस्कार विजेता रहे हैं।",
  "facts": [
    {
      "subject": "Boris Pasternak",
      "predicate": "nominated for",
      "object": "Nobel Prize in Literature",
      "qualifiers": [
        {
          "qualifier_predicate": "point in time",
          "qualifier_subject": "1958"
        }
      ]
    }
  ],
  "language": "hi"
}
```

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

Currently

datasets/93558ee1-134d-46ca-a524-7ec543da4d61.jpg Clear

Change

---

XAlign

Data Format

Record structure (JSON structure)

Examples

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

T-REx

WikiBio

KELM

GenWiki

Usage

License

Modalities

Languages