CNN/Daily Mail is a dataset for text summarization. Human generated abstractive summary bullets were generated from news stories in CNN and Daily Mail websites as questions (with one of the entities hidden), and stories as the corresponding passages from which the system is expected to answer the fill-in the-blank question. The authors released the scripts that crawl, extract and generate pairs of passages and questions from these websites.
436 PAPERS • 12 BENCHMARKS
WikiHow is a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base written by different human authors. The articles span a wide range of topics and represent high diversity styles.
97 PAPERS • 2 BENCHMARKS
Multi-News, consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited.
87 PAPERS • 6 BENCHMARKS
KP20k is a large-scale scholarly articles dataset with 528K articles for training, 20K articles for validation and 20K articles for testing.
77 PAPERS • 3 BENCHMARKS
The How2 dataset contains 13,500 videos, or 300 hours of speech, and is split into 185,187 training, 2022 development (dev), and 2361 test utterances. It has subtitles in English and crowdsourced Portuguese translations.
69 PAPERS • 2 BENCHMARKS
Sentence Compression is a dataset where the syntactic trees of the compressions are subtrees of their uncompressed counterparts, and hence where supervised systems which require a structural alignment between the input and output can be successfully trained.
59 PAPERS • 1 BENCHMARK
The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes:
51 PAPERS • 1 BENCHMARK
WikiLingua includes ~770k article and summary pairs in 18 languages from WikiHow. Gold-standard article-summary alignments across languages are extracted by aligning the images that are used to describe each how-to step in an article.
42 PAPERS • 5 BENCHMARKS
GovReport is a dataset for long document summarization, with significantly longer documents and summaries. It consists of reports written by government research agencies including Congressional Research Service and U.S. Government Accountability Office.
39 PAPERS • 2 BENCHMARKS
A large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community.
37 PAPERS • 7 BENCHMARKS
QMSum is a new human-annotated benchmark for query-based multi-domain meeting summarisation task, which consists of 1,808 query-summary pairs over 232 meetings in multiple domains.
36 PAPERS • 1 BENCHMARK
DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics.
33 PAPERS • 2 BENCHMARKS
XL-Sum is a comprehensive and diverse dataset for abstractive summarization comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.
25 PAPERS • NO BENCHMARKS YET
The Extreme Summarization (XSum) dataset is a dataset for evaluation of abstractive single-document summarization systems. The goal is to create a short, one-sentence new summary answering the question “What is the article about?”. The dataset consists of 226,711 news articles accompanied with a one-sentence summary. The articles are collected from BBC articles (2010 to 2017) and cover a wide variety of domains (e.g., News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts). The official random split contains 204,045 (90%), 11,332 (5%) and 11,334 (5) documents in training, validation and test sets, respectively.
23 PAPERS • 4 BENCHMARKS
BookSum is a collection of datasets for long-form narrative summarization. This dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of this dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures.
21 PAPERS • 1 BENCHMARK
SCROLLS (Standardized CompaRison Over Long Language Sequences) is an NLP benchmark consisting of a suite of tasks that require reasoning over long texts. SCROLLS contains summarization, question answering, and natural language inference tasks, covering multiple domains, including literature, science, business, and entertainment. The dataset is made available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
19 PAPERS • 1 BENCHMARK
The DUC2004 dataset is a dataset for document summarization. Is designed and used for testing only. It consists of 500 news articles, each paired with four human written summaries. Specifically it consists of 50 clusters of Text REtrieval Conference (TREC) documents, from the following collections: AP newswire, 1998-2000; New York Times newswire, 1998-2000; Xinhua News Agency (English version), 1996-2000. Each cluster contained on average 10 documents.
15 PAPERS • 4 BENCHMARKS
This is a dataset for evaluating summarisation methods for research papers.
10 PAPERS • 3 BENCHMARKS
license: apache-2.0 tags: human-feedback size_categories: 100K<n<1M pretty_name: OpenAssistant Conversations
8 PAPERS • NO BENCHMARKS YET
ConvoSumm is a suite of four datasets to evaluate a model’s performance on a broad spectrum of conversation data.
6 PAPERS • NO BENCHMARKS YET
SSN (short for Semantic Scholar Network) is a scientific papers summarization dataset which contains 141K research papers in different domains and 661K citation relationships. The entire dataset constitutes a large connected citation graph.
5 PAPERS • NO BENCHMARKS YET
EUR-Lex-Sum is a dataset for cross-lingual summarization. It is based on manually curated document summaries of legal acts from the European Union law platform. Documents and their respective summaries exist as crosslingual paragraph-aligned data in several of the 24 official European languages, enabling access to various cross-lingual and lower-resourced summarization setups. The dataset contains up to 1,500 document/summary pairs per language, including a subset of 375 cross-lingually aligned legal acts with texts available in all 24 languages.
4 PAPERS • NO BENCHMARKS YET
WikiHowQA is a Community-based Question Answering dataset, which can be used for both answer selection and abstractive summarization tasks. It contains 76,687 questions in the train set, 8,000 in the development set and 22,354 in the test set.
Contains 1507 domain-expert annotated consumer health questions and corresponding summaries. The dataset is derived from the community question answering forum and therefore provides a valuable resource for understanding consumer health-related posts on social media.
3 PAPERS • NO BENCHMARKS YET
There is currently much interest and activity aimed at building powerful multi-purpose information systems. The agencies involved include DARPA, ARDA and NIST. Their programmes, for example DARPA's TIDES (Translingual Information Detection Extraction and Summarization) programme, ARDA's Advanced Question & Answering Program and NIST's TREC (Text Retrieval Conferences) programme cover a range of subprogrammes. These focus on different tasks requiring their own evaluation designs.
2 PAPERS • NO BENCHMARKS YET
Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data, a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HumSet, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HumSet provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HumSet also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of expe
PubMedCite is a domain-specific dataset with about 192K biomedical scientific papers and a large citation graph preserving 917K citation relationships between them. It is characterized by preserving the salient contents extracted from full texts of references, and the weighted correlation between the salient.
1 PAPER • NO BENCHMARKS YET
An open corpus of Scientific Research papers which has a representative sample from across scientific disciplines. This corpus not only includes the full text of the article, but also the metadata of the documents, along with the bibliographic information for each reference.
The "Famous Keyword Twitter Replies Dataset" is a comprehensive collection of Twitter data that focuses on popular keywords and their associated replies. This dataset contains five essential columns that provide valuable insights into the Twitter conversation dynamics:
Mental health remains a significant challenge of public health worldwide. With increasing popularity of online platforms, many use the platforms to share their mental health conditions, express their feelings, and seek help from the community and counselors. While posts are of varying length, it is beneficial to provide a short, but informative summary for fast processing by the counselors. To facilitate research in summarization of mental health online posts, we introduce Mental Health Summarization dataset, MentSum, containing over 24k carefully selected user posts from Reddit, along with their short user-written summary (called TLDR) in English from 43 mental health subreddits.
1 PAPER • 1 BENCHMARK
OASum is a large-scale open-domain aspect-based summarization dataset which contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages.
PMC-SA (PMC Structured Abstracts) is a dataset of academic publications, used for the task of structured summarization.
This is a large-scale court judgment dataset, where each judgment is a summary of the case description with a patternized style. It contains 2,003,390 court judgment documents. The case description is used as the input, and the court judgment as the summary. The average lengths of the input documents and summaries are 595.15 words and 273.57 words respectively.
SubSumE Dataset This repository contains the SubSumE dataset for subjective document summarization. See the paper and the talk for details on dataset creation. Also check out our work SuDocu on example-based document summarization.
TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.
Wiki-en is an annotated English dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN).
Wikipedia Webpage 2M (WikiWeb2M) is a multimodal open source dataset consisting of over 2 million English Wikipedia articles. It is created by rescraping the ∼2M English articles in WIT. Each webpage sample includes the page URL and title, section titles, text, and indices, images and their captions.