5 dataset results for Data Augmentation AND Texts AND German

MuST-C currently represents the largest publicly available multilingual corpus (one-to-many) for speech translation. It covers eight language directions, from English to German, Spanish, French, Italian, Dutch, Portuguese, Romanian and Russian. The corpus consists of audio, transcriptions and translations of English TED talks, and it comes with a predefined training, validation and test split.

194 PAPERS • 2 BENCHMARKS

Patzig

Patzig contains handwritten texts written in modern German. Train sample consists of 485 lines, validation - 38 lines and test -118 lines.

3 PAPERS • NO BENCHMARKS YET

Schiller (Shiller)

Schiller contains handwritten texts written in modern German. Train sample consists of 244 lines, validation - 21 lines and test - 63 lines.

3 PAPERS • NO BENCHMARKS YET

Schwerin

Schwerin contains handwritten texts written in medieval German. Train sample consists of 793 lines, validation - 68 lines and test - 196 lines.

3 PAPERS • NO BENCHMARKS YET

Konzil (Konzilsprotokolle_C)

Konzil dataset was created by specialists of the University of Greifswald. It contains manuscripts written in modern German. Train sample consists of 353 lines, validation - 29 lines and test - 87 lines.

3 PAPERS • NO BENCHMARKS YET

Datasets

5 dataset results for Data Augmentation AND Texts AND German