Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.
314 PAPERS • 266 BENCHMARKS
MuST-C currently represents the largest publicly available multilingual corpus (one-to-many) for speech translation. It covers eight language directions, from English to German, Spanish, French, Italian, Dutch, Portuguese, Romanian and Russian. The corpus consists of audio, transcriptions and translations of English TED talks, and it comes with a predefined training, validation and test split.
194 PAPERS • 2 BENCHMARKS
A corpus of parallel text in 21 European languages from the proceedings of the European Parliament.
125 PAPERS • NO BENCHMARKS YET
Europarl-ST is a multilingual Spoken Language Translation corpus containing paired audio-text samples for SLT from and into 9 European languages, for a total of 72 different translation directions. This corpus has been compiled using the debates held in the European Parliament in the period between 2008 and 2012.
55 PAPERS • NO BENCHMARKS YET
CoVoST is a large-scale multilingual speech-to-text translation corpus. Its latest 2nd version covers translations from 21 languages into English and from English into 15 languages. It has total 2880 hours of speech and is diversified with 78K speakers and 66 accents.
32 PAPERS • NO BENCHMARKS YET
Overall duration per microphone: about 36 hours (31 hrs train / 2.5 hrs dev / 2.5 hrs test) Count of microphones: 3 (Microsoft Kinect, Yamaha, Samson) Count of wave-files per microphone: about 14500 Overall count of participations: 180 (130 male / 50 female)
11 PAPERS • 1 BENCHMARK
The EMODB database is the freely available German emotional database. The database is created by the Institute of Communication Science, Technical University, Berlin, Germany. Ten professional speakers (five males and five females) participated in data recording. The database contains a total of 535 utterances. The EMODB database comprises of seven emotions: 1) anger; 2) boredom; 3) anxiety; 4) happiness; 5) sadness; 6) disgust; and 7) neutral. The data was recorded at a 48-kHz sampling rate and then down-sampled to 16-kHz.
6 PAPERS • 1 BENCHMARK
LibriVoxDeEn is a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audiobooks. The speech translation data consist of 110 hours of audio material aligned to over 50k parallel sentences. An even larger dataset comprising 547 hours of German speech aligned to German text is available for speech recognition. The audio data is read speech and thus low in disfluencies.
3 PAPERS • NO BENCHMARKS YET
Tilde MODEL Corpus is a multilingual corpora for European languages – particularly focused on the smaller languages. The collected resources have been cleaned, aligned, and formatted into a corpora standard TMX format useable for developing new Language technology products and services.
2 PAPERS • NO BENCHMARKS YET
JamALT is a revision of the JamendoLyrics dataset (80 songs in 4 languages), adapted for use as an automatic lyrics transcription (ALT) benchmark.
1 PAPER • 5 BENCHMARKS