10 dataset results for Multi-Task Learning AND Texts AND English

Clotho is an audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long.

139 PAPERS • 5 BENCHMARKS

KP20k

KP20k is a large-scale scholarly articles dataset with 528K articles for training, 20K articles for validation and 20K articles for testing.

79 PAPERS • 3 BENCHMARKS

CELEX

CELEX database comprises three different searchable lexical databases, Dutch, English and German. The lexical data contained in each database is divided into five categories: orthography, phonology, morphology, syntax (word class) and word frequency.

57 PAPERS • NO BENCHMARKS YET

VMSMO

The Video-based Multimodal Summarization with Multimodal Output (VMSMO) corpus consists of 184,920 document-summary pairs, with 180,000 training pairs, 2,460 validation and test pairs. The task for this dataset is generating and appropriate textual summary of an article and choosing a proper cover frame from a video accompanying the article.

8 PAPERS • NO BENCHMARKS YET

SkillSpan

SkillSpan (Hard and Soft Skill Extraction from English Job Postings)

SkillSpan is a dataset for Skill Extraction (SE). It is an important and widely-studied task useful to gain insights into labor market dynamics. However, there is a lacuna of datasets and annotation guidelines; available datasets are few and contain crowd-sourced labels on the span-level or labels from a predefined skill inventory. To address this gap, the authors introduce SkillSpan, a novel SE dataset consisting of 14.5K sentences and over 12.5K annotated spans.

7 PAPERS • NO BENCHMARKS YET

Famulus

This is a dataset for segmentation and classification of epistemic activities in diagnostic reasoning texts.

3 PAPERS • NO BENCHMARKS YET

Tasksource

Huggingface Datasets is a great library, but it lacks standardization, and datasets require preprocessing work to be used interchangeably. tasksource automates this and facilitates reproducible multi-task learning scaling.

3 PAPERS • NO BENCHMARKS YET

CQR (Contextual Query Rewrite)

CQR is an extension to the Stanford Dialogue Corpus. It contains crowd-sourced rewrites to facilitate research in dialogue state tracking using natural language as the interface.

2 PAPERS • NO BENCHMARKS YET

ExHVV

ExHVV is a novel dataset that offers natural language explanations of connotative roles for three types of entities -- heroes, villains, and victims, encompassing 4,680 entities present in 3K memes.

1 PAPER • NO BENCHMARKS YET

MerRec (MerRec Recommendation Dataset)

A large scale, C2C marketplace e-commerce dataset.

1 PAPER • NO BENCHMARKS YET