This is a dataset for multi-document summarization in Portuguese, what means that it has examples of multiple documents (input) related to human-written summaries (output). In particular, it has entries of multiple related texts from Brazilian websites about a subject, and the summary is the Portuguese Wikipedia lead section on the same subject (lead: the first section, i.e., summary, of any Wipedia article). Input texts were extracted from BrWac corpus, and the output from Brazilian Wikipedia dumps page.

BrWac2Wiki contains 114.652 examples of (documents, wikipedia) pairs! So it is suitable for training and validating AI models for multi-document summarization in Portuguese. More information on the paper "PLSUM: Generating PT-BR Wikipedia by Summarizing Websites", by André Seidel Oliveira¹ and Anna Helena Reali Costa¹, that is going to be presented at ENIAC 2021. Our work is inspired by WikiSum, a similar dataset for the English language.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages