Scribosermo: Fast Speech-to-Text models for German and other Languages

15 Oct 2021  ·  Daniel Bermuth, Alexander Poeppel, Wolfgang Reif ·

Recent Speech-to-Text models often require a large amount of hardware resources and are mostly trained in English. This paper presents Speech-to-Text models for German, as well as for Spanish and French with special features: (a) They are small and run in real-time on microcontrollers like a RaspberryPi. (b) Using a pretrained English model, they can be trained on consumer-grade hardware with a relatively small dataset. (c) The models are competitive with other solutions and outperform them in German. In this respect, the models combine advantages of other approaches, which only include a subset of the presented features. Furthermore, the paper provides a new library for handling datasets, which is focused on easy extension with additional datasets and shows an optimized way for transfer-learning new languages using a pretrained model from another language with a similar alphabet.

PDF Abstract

Results from the Paper


 Ranked #1 on Speech Recognition on Common Voice Italian (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Speech Recognition Common Voice French QuartzNet15x5FR (CV-only) Test WER 12.1% # 7
Speech Recognition Common Voice French ConformerCTC-L (5-gram) Test WER 8.13% # 1
Speech Recognition Common Voice French ConformerCTC-L (no-LM) Test WER 10.19 % # 5
Speech Recognition Common Voice French QuartzNet15x5FR (D7) Test WER 11.0% # 6
Speech Recognition Common Voice German QuartzNet15x5DE (CV-only, 5-gram) Test WER 7.7% # 10
Test CER 3.2% # 6
Speech Recognition Common Voice German ConformerCTC-L (5-gram) Test WER 4.05% # 3
Test CER 1.37% # 1
Speech Recognition Common Voice German ConformerCTC-L (no LM) Test WER 7.33% # 9
Test CER 2.05% # 4
Speech Recognition Common Voice German QuartzNet15x5DE (D37, 5-gram) Test WER 6.6% # 7
Test CER 2.7% # 5
Speech Recognition Common Voice Italian QuartzNet15x5IT (D5) Test WER 11.5% # 1
Speech Recognition Common Voice Spanish QuartzNet15x5ES (D8) Test WER 10.0% # 5
Speech Recognition Common Voice Spanish QuartzNet15x5ES (CV-only) Test WER 10.5% # 7
Speech Recognition Common Voice Spanish ConformerCTC-L (no-LM) Test WER 7.46 % # 4
Speech Recognition Common Voice Spanish ConformerCTC-L (5-gram) Test WER 5.68% # 2
Speech Recognition TUDA QuartzNet15x5DE (D37) Test WER 10.2% # 3

Methods


No methods listed for this paper. Add relevant methods here