🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task (clear)

Filter by Language

28 dataset results for Data Augmentation AND Texts

MuST-C currently represents the largest publicly available multilingual corpus (one-to-many) for speech translation. It covers eight language directions, from English to German, Spanish, French, Italian, Dutch, Portuguese, Romanian and Russian. The corpus consists of audio, transcriptions and translations of English TED talks, and it comes with a predefined training, validation and test split.

194 PAPERS • 2 BENCHMARKS

IAM (IAM Handwriting)

The IAM database contains 13,353 images of handwritten lines of text created by 657 writers. The texts those writers transcribed are from the Lancaster-Oslo/Bergen Corpus of British English. It includes contributions from 657 writers making a total of 1,539 handwritten pages comprising of 115,320 words and is categorized as part of modern collection. The database is labeled at the sentence, line, and word levels.

173 PAPERS • 2 BENCHMARKS

PAWS (Paraphrase Adversaries from Word Scrambling)

Paraphrase Adversaries from Word Scrambling (PAWS) is a dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset.

144 PAPERS • NO BENCHMARKS YET

Clotho

Clotho is an audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long.

139 PAPERS • 5 BENCHMARKS

MathQA

MathQA significantly enhances the AQuA dataset with fully-specified operational programs.

94 PAPERS • 1 BENCHMARK

GAP Coreference Dataset

GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia and released by Google AI Language for the evaluation of coreference resolution in practical applications.

93 PAPERS • 1 BENCHMARK

SParC (Semantic Parsing in Context)

SParC is a large-scale dataset for complex, cross-domain, and context-dependent (multi-turn) semantic parsing and text-to-SQL task (interactive natural language interfaces for relational databases).

48 PAPERS • 2 BENCHMARKS

HELP

The HELP dataset is an automatically created natural language inference (NLI) dataset that embodies the combination of lexical and logical inferences focusing on monotonicity (i.e., phrase replacement-based reasoning). The HELP (Ver.1.0) has 36K inference pairs consisting of upward monotone, downward monotone, non-monotone, conjunction, and disjunction.

28 PAPERS • 1 BENCHMARK

Evidence Inference

Evidence Inference is a corpus for this task comprising 10,000+ prompts coupled with full-text articles describing RCTs.

25 PAPERS • NO BENCHMARKS YET

MeQSum

MeQSum is a dataset for medical question summarization. It contains 1,000 summarized consumer health questions.

25 PAPERS • 1 BENCHMARK

MedQuAD

MedQuAD (Medical Question Answering Dataset)

MedQuAD includes 47,457 medical question-answer pairs created from 12 NIH websites (e.g. cancer.gov, niddk.nih.gov, GARD, MedlinePlus Health Topics). The collection covers 37 question types (e.g. Treatment, Diagnosis, Side Effects) associated with diseases, drugs and other medical entities such as tests.

23 PAPERS • NO BENCHMARKS YET

CONAN (COunter NArratives through Nichesourcing)

COunter NArratives through Nichesourcing (CONAN) is a dataset that consists of 4,078 pairs over the 3 languages. Additionally, 3 types of metadata are provided: expert demographics, hate speech sub-topic and counter-narrative type. The dataset is augmented through translation (from Italian/French to English) and paraphrasing, which brought the total number of pairs to 14.988.

21 PAPERS • NO BENCHMARKS YET

KdConv (Knowledge-driven Conversation)

KdConv is a Chinese multi-domain Knowledge-driven Conversation dataset, grounding the topics in multi-turn conversations to knowledge graphs. KdConv contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0. These conversations contain in-depth discussions on related topics and natural transition between multiple topics, while the corpus can also used for exploration of transfer learning and domain adaptation.

21 PAPERS • NO BENCHMARKS YET

George Washington

The George Washington dataset contains 20 pages of letters written by George Washington and his associates in 1755 and thereby categorized into historical collection. The images are annotated at word level and contain approximately 5,000 words.

19 PAPERS • NO BENCHMARKS YET

MED (Monotonicity Entailment Dataset)

MED is a new evaluation dataset that covers a wide range of monotonicity reasoning that was created by crowdsourcing and collected from linguistics publications. The dataset was constructed by collecting naturally-occurring examples by crowdsourcing and well-designed ones from linguistics publications. It consists of 5,382 examples.

18 PAPERS • 1 BENCHMARK

ETHOS

ETHOS (multi-labEl haTe speecH detectiOn dataSet)

ETHOS is a hate speech detection dataset. It is built from YouTube and Reddit comments validated through a crowdsourcing platform. It has two subsets, one for binary classification and the other for multi-label classification. The former contains 998 comments, while the latter contains fine-grained hate-speech annotations for 433 comments.

17 PAPERS • 2 BENCHMARKS

InfoTabS

InfoTabS comprises of human-written textual hypotheses based on premises that are tables extracted from Wikipedia info-boxes.

13 PAPERS • NO BENCHMARKS YET

word2word

word2word contains easy-to-use word translations for 3,564 language pairs.

5 PAPERS • NO BENCHMARKS YET

Bentham (Bentham project)

Bentham manuscripts refers to a large set of documents that were written by the renowned English philosopher and reformer Jeremy Bentham (1748-1832). Volunteers of the Transcribe Bentham initiative transcribed this collection. Currently, >6 000 documents or > 25 000 pages have been transcribed using this public web platform. For our experiments, we used the BenthamR0 dataset a part of the Bentham manuscripts.

4 PAPERS • 1 BENCHMARK

Konzil (Konzilsprotokolle_C)

Konzil dataset was created by specialists of the University of Greifswald. It contains manuscripts written in modern German. Train sample consists of 353 lines, validation - 29 lines and test - 87 lines.

3 PAPERS • NO BENCHMARKS YET

Patzig

Patzig contains handwritten texts written in modern German. Train sample consists of 485 lines, validation - 38 lines and test -118 lines.

3 PAPERS • NO BENCHMARKS YET

Ricordi

Ricordi contains handwritten texts written in Italian. Train sample consists of 295 lines, validation - 19 lines and test - 69 lines.

3 PAPERS • NO BENCHMARKS YET

Schiller (Shiller)

Schiller contains handwritten texts written in modern German. Train sample consists of 244 lines, validation - 21 lines and test - 63 lines.

3 PAPERS • NO BENCHMARKS YET

Schwerin

Schwerin contains handwritten texts written in medieval German. Train sample consists of 793 lines, validation - 68 lines and test - 196 lines.

3 PAPERS • NO BENCHMARKS YET

Saint Gall

Saint Gall dataset contains handwritten historical manuscripts written in Latin that date back to the 9th century. It consists of 60 pages, 1 410 text lines and 11 597 words.

2 PAPERS • 1 BENCHMARK

CECW (Colorful Extended Cleanup World)

The CECW dataset is a color-extended version of the Cleanup World (CW) borrowed from the mobile-manipulation robot domain. CW refers to a world equipped with a movable object as well as four rooms in four colors, including "blue," "green," "red," and "yellow," which is designed as a simulation environment where the agent can act based on the instructions received. CW obeys a particular Geometric Linear Temporal Logic (GLTL) to parse commands by grammatical syntax, resulting in a total of 3,382 commands reflecting 39 GLTL expressions.

1 PAPER • NO BENCHMARKS YET

Gutenberg Poem Dataset

Gutenberg Poem Dataset is used for the next verse prediction component.

1 PAPER • NO BENCHMARKS YET

RadioTalk

RadioTalk is a corpus of speech recognition transcripts sampled from talk radio broadcasts in the United States between October of 2018 and March of 2019. The corpus is intended for use by researchers in the fields of natural language processing, conversational analysis, and the social sciences. The corpus encompasses approximately 2.8 billion words of automatically transcribed speech from 284,000 hours of radio, together with metadata about the speech, such as geographical location, speaker turn boundaries, gender, and radio program information.

1 PAPER • NO BENCHMARKS YET

Datasets

28 dataset results for Data Augmentation AND Texts