Clotho is an audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long.
139 PAPERS • 5 BENCHMARKS
KP20k is a large-scale scholarly articles dataset with 528K articles for training, 20K articles for validation and 20K articles for testing.
79 PAPERS • 3 BENCHMARKS
CELEX database comprises three different searchable lexical databases, Dutch, English and German. The lexical data contained in each database is divided into five categories: orthography, phonology, morphology, syntax (word class) and word frequency.
57 PAPERS • NO BENCHMARKS YET
The Video-based Multimodal Summarization with Multimodal Output (VMSMO) corpus consists of 184,920 document-summary pairs, with 180,000 training pairs, 2,460 validation and test pairs. The task for this dataset is generating and appropriate textual summary of an article and choosing a proper cover frame from a video accompanying the article.
8 PAPERS • NO BENCHMARKS YET
SkillSpan is a dataset for Skill Extraction (SE). It is an important and widely-studied task useful to gain insights into labor market dynamics. However, there is a lacuna of datasets and annotation guidelines; available datasets are few and contain crowd-sourced labels on the span-level or labels from a predefined skill inventory. To address this gap, the authors introduce SkillSpan, a novel SE dataset consisting of 14.5K sentences and over 12.5K annotated spans.
7 PAPERS • NO BENCHMARKS YET
This is a dataset for segmentation and classification of epistemic activities in diagnostic reasoning texts.
3 PAPERS • NO BENCHMARKS YET
Huggingface Datasets is a great library, but it lacks standardization, and datasets require preprocessing work to be used interchangeably. tasksource automates this and facilitates reproducible multi-task learning scaling.
CQR is an extension to the Stanford Dialogue Corpus. It contains crowd-sourced rewrites to facilitate research in dialogue state tracking using natural language as the interface.
2 PAPERS • NO BENCHMARKS YET
ExHVV is a novel dataset that offers natural language explanations of connotative roles for three types of entities -- heroes, villains, and victims, encompassing 4,680 entities present in 3K memes.
1 PAPER • NO BENCHMARKS YET
A large scale, C2C marketplace e-commerce dataset.