Speech

Europarl-ASR

Introduced by Díaz-Munío et al. in Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

Europarl-ASR (EN) is a 1300-hour English-language speech and text corpus of parliamentary debates for (streaming) Automatic Speech Recognition training and benchmarking, speech data filtering and speech data verbatimization, based on European Parliament speeches and their official transcripts (1996-2020). Includes dev-test sets for streaming ASR benchmarking, made up of 18 hours of manually revised speeches. The availability of manual non-verbatim and verbatim transcripts for dev-test speeches makes this corpus also useful for the assessment of automatic filtering and verbatimization techniques. The corpus is released under an open licence at https://www.mllp.upv.es/europarl-asr/

Europarl-ASR CONTENTS: [Speech data] 1300 hours of English-language annotated speech data, 3 full sets of timed transcriptions (official non-verbatim, automatically noise-filtered, automatically verbatimized), 18 hours of speech data with both manually revised verbatim transcriptions and official non-verbatim transcriptions, split in 2 independent validation- evaluation partitions for 2 realistic ASR tasks (with vs. without previous knowledge of the speaker); [Text data] 70 million tokens of English-language text data; [Pretrained language models] the Europarl-ASR English-language n-gram language model and vocabulary.

Homepage