SciCite is a dataset of citation intents that addresses multiple scientific domains and is more than five times larger than ACL-ARC.
34 PAPERS • 3 BENCHMARKS
A SemEval shared task in which participants must extract definitions from free text using a term-definition pair corpus that reflects the complex reality of definitions in natural language.
14 PAPERS • NO BENCHMARKS YET
ACL Anthology Reference Corpus (ACL ARC) is a collection of 10,920 academic papers from the ACL Anthology. ACL ARC is cleaned to remove:
12 PAPERS • 4 BENCHMARKS
CSPubSum is a dataset for summarisation of computer science publications, created by exploiting a large resource of author provided summaries and show straightforward ways of extending it further.
3 PAPERS • NO BENCHMARKS YET
CSAbstruct is a new dataset of annotated computer science abstracts with sentence labels according to their rhetorical roles. The key difference between this dataset and PUBMED-RCT is that PubMed abstracts are written according to a predefined structure, whereas computer science papers are free-form. Therefore, there is more variety in writing styles in CSABSTRUCT. CSABSTRUCT is collected from the Semantic Scholar corpus (Ammar et al., 2018). Each sentence is annotated by 5 workers on the Figure-eight platform,6 with one of 5 categories {BACKGROUND, OBJECTIVE, METHOD, RESULT, OTHER}.
1 PAPER • NO BENCHMARKS YET
A dataset of games played in the card game "Cards Against Humanity" (CAH), by human players, derived from the online CAH labs. Each round includes the cards presented to users - a "black" prompt with a blank or question and 10 "white" punchlines as possible responses, and which punchline was picked by a player each round, along with text and metadata.
E2E Refined is a dataset for sentence classification. It consists of 40,560 examples for training, 4,489 for validation, and 4,555 for test. It is a refined version of the well-known MR-to-text E2E dataset where many deletion/insertion/substitution errors has been fixed.