We provide a new data set XWikiRef for the task of Cross-lingual Multi-document Summarization. This task aims at generating Wikipedia style text in Low Resource languages by taking reference text as input. Overall, the data set contains 8 different languages: bengali (bn), english (en), hindi (hi), marathi (mr), malayalam (ml), odia (or), punjabi (pa) and tamil (ta). It also contains 5 domains: books, films, politicians, sportsman and writers.

Data Format

Dataset is publicly available here. Each directory contains language specific data subset having 1 json file per domain. In each file, each line denotes one article. It contains the following set of keys:

  • Article title
  • Sections
    • section title 1
    • section text 1
    • list of reference texts 1
    • .....
    • .....
    • .....
    • section title n
    • section text n
    • list of reference texts 1


Paper Code Results Date Stars

Dataset Loaders

No data loaders found. You can submit your data loader here.


Similar Datasets