DiscoFuse was created by applying a rule-based splitting method on two corpora - sports articles crawled from the Web, and Wikipedia. See the paper for a detailed description of the dataset generation process and evaluation.
DiscoFuse has two parts with 44,177,443 and 16,642,323 examples sourced from Sports articles and Wikipedia, respectively.
For each part, a random split is provided to train (98% of the examples), development (1%) and test (1%) sets. In addition, as the original data distribution is highly skewed (see details in the paper), a balanced version for each part is also provided.Source: Google Research