The Newsela dataset was introduced by Xu et al. in their research on text simplification. It is a corpus that includes thousands of news articles professionally leveled to different reading complexities. The dataset is used for academic research in fields such as text difficulty and text simplification. It is made available to academic partners upon request. The dataset is often used as a benchmark in the field of text simplification. Please note that the Newsela dataset is different from the NELA datasets, which are collections of news articles for the study of media bias and other applications.
104 PAPERS • 1 BENCHMARK
WikiLarge comprise 359 test sentences, 2000 development sentences and 300k training sentences. Each source sentences in test set has 8 simplified references
65 PAPERS • NO BENCHMARKS YET
TurkCorpus, a dataset with 2,359 original sentences from English Wikipedia, each with 8 manual reference simplifications. The dataset is divided into two subsets: 2,000 sentences for validation and 359 for testing of sentence simplification models.
43 PAPERS • 1 BENCHMARK
CEFR-SP contains 17k English sentences annotated with the levels based on the Common European Framework of Reference for Languages assigned by English-education professionals.
3 PAPERS • NO BENCHMARKS YET
This dataset contains around 5000 scholarly articles and their corresponding easy summary from eureka alert blog, the dataset can be used for the combined task of summarization and simplification.
3 PAPERS • 1 BENCHMARK
Med-EASi (Medical dataset for Elaborative and Abstractive Simplification), a uniquely crowdsourced and finely annotated dataset for supervised simplification of short medical texts. It contains 1979 expert-simple text pairs in medical domain, spanning a total of 4478 UMLS concepts across all text pairs. The dataset is annotated with four textual transformations: replacement, elaboration, insertion and deletion.
TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.
2 PAPERS • NO BENCHMARKS YET
The goal of InfoLossQA is to generate a series of QA pairs that reveal to lay readers what information a simplified text lacks compared to its original.
1 PAPER • NO BENCHMARKS YET
A medical Wiki paralell corpus for medical text simplification.
SimpEvalASSET is a dataset for learning learnable metrics using modern language models. It comprises of 12K human ratings on 2.4K simplifications of 24 systems, and SIMPEVAL_2022, a challenging simplification benchmark consisting of over 1K human ratings of 360 simplifications including generations from GPT-3.5.