DiscoFuse

Introduced by Geva et al. in DiscoFuse: A Large-Scale Dataset for Discourse-Based Sentence Fusion

DiscoFuse was created by applying a rule-based splitting method on two corpora - sports articles crawled from the Web, and Wikipedia. See the paper for a detailed description of the dataset generation process and evaluation.

DiscoFuse has two parts with 44,177,443 and 16,642,323 examples sourced from Sports articles and Wikipedia, respectively.

For each part, a random split is provided to train (98% of the examples), development (1%) and test (1%) sets. In addition, as the original data distribution is highly skewed (see details in the paper), a balanced version for each part is also provided.

Source: Google Research

Homepage