Scribosermo: Fast Speech-to-Text models for German and other Languages
Recent Speech-to-Text models often require a large amount of hardware resources and are mostly trained in English. This paper presents Speech-to-Text models for German, as well as for Spanish and French with special features: (a) They are small and run in real-time on microcontrollers like a RaspberryPi. (b) Using a pretrained English model, they can be trained on consumer-grade hardware with a relatively small dataset. (c) The models are competitive with other solutions and outperform them in German. In this respect, the models combine advantages of other approaches, which only include a subset of the presented features. Furthermore, the paper provides a new library for handling datasets, which is focused on easy extension with additional datasets and shows an optimized way for transfer-learning new languages using a pretrained model from another language with a similar alphabet.
PDF AbstractDatasets
Results from the Paper
Ranked #1 on
Speech Recognition
on Common Voice Italian
(using extra training data)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Speech Recognition | Common Voice French | QuartzNet15x5FR (CV-only) | Test WER | 12.1% | # 7 | ||
Speech Recognition | Common Voice French | ConformerCTC-L (5-gram) | Test WER | 8.13% | # 1 | ||
Speech Recognition | Common Voice French | ConformerCTC-L (no-LM) | Test WER | 10.19 % | # 5 | ||
Speech Recognition | Common Voice French | QuartzNet15x5FR (D7) | Test WER | 11.0% | # 6 | ||
Speech Recognition | Common Voice German | QuartzNet15x5DE (CV-only, 5-gram) | Test WER | 7.7% | # 10 | ||
Test CER | 3.2% | # 6 | |||||
Speech Recognition | Common Voice German | ConformerCTC-L (5-gram) | Test WER | 4.05% | # 3 | ||
Test CER | 1.37% | # 1 | |||||
Speech Recognition | Common Voice German | ConformerCTC-L (no LM) | Test WER | 7.33% | # 9 | ||
Test CER | 2.05% | # 4 | |||||
Speech Recognition | Common Voice German | QuartzNet15x5DE (D37, 5-gram) | Test WER | 6.6% | # 7 | ||
Test CER | 2.7% | # 5 | |||||
Speech Recognition | Common Voice Italian | QuartzNet15x5IT (D5) | Test WER | 11.5% | # 1 | ||
Speech Recognition | Common Voice Spanish | QuartzNet15x5ES (D8) | Test WER | 10.0% | # 5 | ||
Speech Recognition | Common Voice Spanish | QuartzNet15x5ES (CV-only) | Test WER | 10.5% | # 7 | ||
Speech Recognition | Common Voice Spanish | ConformerCTC-L (no-LM) | Test WER | 7.46 % | # 4 | ||
Speech Recognition | Common Voice Spanish | ConformerCTC-L (5-gram) | Test WER | 5.68% | # 2 | ||
Speech Recognition | TUDA | QuartzNet15x5DE (D37) | Test WER | 10.2% | # 3 |