Click to add a brief description of the dataset (Markdown and LaTeX enabled).
1 PAPER • NO BENCHMARKS YET
OpenAsp Dataset OpenAsp is an Open Aspect-based Multi-Document Summarization dataset derived from DUC and MultiNews summarization datasets.
2 PAPERS • NO BENCHMARKS YET
Inshorts News dataset Inshorts provides a news summary in 60 words or less. Inshorts is a news service that offers short summaries of news from around the web. This dataset contains headlines and a summary of news items and their source.
1 PAPER • 1 BENCHMARK
Wikipedia Webpage 2M (WikiWeb2M) is a multimodal open source dataset consisting of over 2 million English Wikipedia articles. It is created by rescraping the ∼2M English articles in WIT. Each webpage sample includes the page URL and title, section titles, text, and indices, images and their captions.
license: apache-2.0 tags: human-feedback size_categories: 100K<n<1M pretty_name: OpenAssistant Conversations
14 PAPERS • NO BENCHMARKS YET
PubMedCite is a domain-specific dataset with about 192K biomedical scientific papers and a large citation graph preserving 917K citation relationships between them. It is characterized by preserving the salient contents extracted from full texts of references, and the weighted correlation between the salient.
3 PAPERS • NO BENCHMARKS YET
TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.
OASum is a large-scale open-domain aspect-based summarization dataset which contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages.
EUR-Lex-Sum is a dataset for cross-lingual summarization. It is based on manually curated document summaries of legal acts from the European Union law platform. Documents and their respective summaries exist as crosslingual paragraph-aligned data in several of the 24 official European languages, enabling access to various cross-lingual and lower-resourced summarization setups. The dataset contains up to 1,500 document/summary pairs per language, including a subset of 375 cross-lingually aligned legal acts with texts available in all 24 languages.
5 PAPERS • NO BENCHMARKS YET
Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data, a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HumSet, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HumSet provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HumSet also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of expe
WikiDes is a dataset for generating descriptions of Wikidata from Wikipedia paragraphs.
This dataset was used in the paper 'Template-based Abstractive Microblog Opinion Summarisation' (to be published at TACL, 2022). The data is structured as follows: each file represents a cluster of tweets which contains the tweet IDs and a summary of the tweets written by journalists. The gold standard summary follows a template structure and depending on its opinion content, it contains a main story, majority opinion (if any) and/or minority opinions (if any).
Contains 1507 domain-expert annotated consumer health questions and corresponding summaries. The dataset is derived from the community question answering forum and therefore provides a valuable resource for understanding consumer health-related posts on social media.
Mental health remains a significant challenge of public health worldwide. With increasing popularity of online platforms, many use the platforms to share their mental health conditions, express their feelings, and seek help from the community and counselors. While posts are of varying length, it is beneficial to provide a short, but informative summary for fast processing by the counselors. To facilitate research in summarization of mental health online posts, we introduce Mental Health Summarization dataset, MentSum, containing over 24k carefully selected user posts from Reddit, along with their short user-written summary (called TLDR) in English from 43 mental health subreddits.
Can language models read biomedical texts and explain the biomedical mechanisms discussed? In this work we introduce a biomedical mechanism summarization task. Biomedical studies often investigate the mechanisms behind how one entity (e.g., a protein or a chemical) affects another in a biological context. The abstracts of these publications often include a focused set of sentences that present relevant supporting statements regarding such relationships, associated experimental evidence, and a concluding sentence that summarizes the mechanism underlying the relationship. We leverage this structure and create a summarization task, where the input is a collection of sentences and the main entities in an abstract, and the output includes the relationship and a sentence that summarizes the mechanism. Using a small amount of manually labeled mechanism sentences, we train a mechanism sentence classifier to filter a large biomedical abstract collection and create a summarization dataset with 2
SCROLLS (Standardized CompaRison Over Long Language Sequences) is an NLP benchmark consisting of a suite of tasks that require reasoning over long texts. SCROLLS contains summarization, question answering, and natural language inference tasks, covering multiple domains, including literature, science, business, and entertainment. The dataset is made available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
31 PAPERS • 1 BENCHMARK
SubSumE Dataset This repository contains the SubSumE dataset for subjective document summarization. See the paper and the talk for details on dataset creation. Also check out our work SuDocu on example-based document summarization.
XL-Sum is a comprehensive and diverse dataset for abstractive summarization comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.
44 PAPERS • NO BENCHMARKS YET
ConvoSumm is a suite of four datasets to evaluate a model’s performance on a broad spectrum of conversation data.
4 PAPERS • NO BENCHMARKS YET
BookSum is a collection of datasets for long-form narrative summarization. This dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of this dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures.
30 PAPERS • 1 BENCHMARK
DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics.
38 PAPERS • 2 BENCHMARKS
This dataset was created using a dataset used for data categorization that onsists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005 used in the paper of D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006; whose all rights, including copyright, in the content of the original articles are owned by the BBC. More at http://mlg.ucd.ie/datasets/bbc.html
QMSum is a new human-annotated benchmark for query-based multi-domain meeting summarisation task, which consists of 1,808 query-summary pairs over 232 meetings in multiple domains.
49 PAPERS • 1 BENCHMARK
SSN (short for Semantic Scholar Network) is a scientific papers summarization dataset which contains 141K research papers in different domains and 661K citation relationships. The entire dataset constitutes a large connected citation graph.
GovReport is a dataset for long document summarization, with significantly longer documents and summaries. It consists of reports written by government research agencies including Congressional Research Service and U.S. Government Accountability Office.
53 PAPERS • 2 BENCHMARKS
This dataset contains around 5000 scholarly articles and their corresponding easy summary from eureka alert blog, the dataset can be used for the combined task of summarization and simplification.
3 PAPERS • 1 BENCHMARK
WikiHowQA is a Community-based Question Answering dataset, which can be used for both answer selection and abstractive summarization tasks. It contains 76,687 questions in the train set, 8,000 in the development set and 22,354 in the test set.
Multi-News, consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited.
103 PAPERS • 4 BENCHMARKS
The How2 dataset contains 13,500 videos, or 300 hours of speech, and is split into 185,187 training, 2022 development (dev), and 2361 test utterances. It has subtitles in English and crowdsourced Portuguese translations.
73 PAPERS • 2 BENCHMARKS
WikiHow is a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base written by different human authors. The articles span a wide range of topics and represent high diversity styles.
106 PAPERS • 2 BENCHMARKS
The Extreme Summarization (XSum) dataset is a dataset for evaluation of abstractive single-document summarization systems. The goal is to create a short, one-sentence new summary answering the question “What is the article about?”. The dataset consists of 226,711 news articles accompanied with a one-sentence summary. The articles are collected from BBC articles (2010 to 2017) and cover a wide variety of domains (e.g., News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts). The official random split contains 204,045 (90%), 11,332 (5%) and 11,334 (5) documents in training, validation and test sets, respectively.
27 PAPERS • 6 BENCHMARKS
KP20k is a large-scale scholarly articles dataset with 528K articles for training, 20K articles for validation and 20K articles for testing.
79 PAPERS • 3 BENCHMARKS
The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.
593 PAPERS • 13 BENCHMARKS
CNN/Daily Mail is a dataset for text summarization. Human generated abstractive summary bullets were generated from news stories in CNN and Daily Mail websites as questions (with one of the entities hidden), and stories as the corresponding passages from which the system is expected to answer the fill-in the-blank question. The authors released the scripts that crawl, extract and generate pairs of passages and questions from these websites.
466 PAPERS • 10 BENCHMARKS
The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes:
265 PAPERS • 8 BENCHMARKS
The DUC2004 dataset is a dataset for document summarization. Is designed and used for testing only. It consists of 500 news articles, each paired with four human written summaries. Specifically it consists of 50 clusters of Text REtrieval Conference (TREC) documents, from the following collections: AP newswire, 1998-2000; New York Times newswire, 1998-2000; Xinhua News Agency (English version), 1996-2000. Each cluster contained on average 10 documents.
15 PAPERS • 4 BENCHMARKS
There is currently much interest and activity aimed at building powerful multi-purpose information systems. The agencies involved include DARPA, ARDA and NIST. Their programmes, for example DARPA's TIDES (Translingual Information Detection Extraction and Summarization) programme, ARDA's Advanced Question & Answering Program and NIST's TREC (Text Retrieval Conferences) programme cover a range of subprogrammes. These focus on different tasks requiring their own evaluation designs.
An open corpus of Scientific Research papers which has a representative sample from across scientific disciplines. This corpus not only includes the full text of the article, but also the metadata of the documents, along with the bibliographic information for each reference.
The "Famous Keyword Twitter Replies Dataset" is a comprehensive collection of Twitter data that focuses on popular keywords and their associated replies. This dataset contains five essential columns that provide valuable insights into the Twitter conversation dynamics:
A large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community.
40 PAPERS • 5 BENCHMARKS
PMC-SA (PMC Structured Abstracts) is a dataset of academic publications, used for the task of structured summarization.
This is a large-scale court judgment dataset, where each judgment is a summary of the case description with a patternized style. It contains 2,003,390 court judgment documents. The case description is used as the input, and the court judgment as the summary. The average lengths of the input documents and summaries are 595.15 words and 273.57 words respectively.
Sentence Compression is a dataset where the syntactic trees of the compressions are subtrees of their uncompressed counterparts, and hence where supervised systems which require a structural alignment between the input and output can be successfully trained.
61 PAPERS • 1 BENCHMARK
Wiki-en is an annotated English dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN).
WikiLingua includes ~770k article and summary pairs in 18 languages from WikiHow. Gold-standard article-summary alignments across languages are extracted by aligning the images that are used to describe each how-to step in an article.
50 PAPERS • 5 BENCHMARKS
This is a dataset for evaluating summarisation methods for research papers.
10 PAPERS • 3 BENCHMARKS