Automatic Speech Recognition in German: A Detailed Error Analysis
The amount of freely available systems for automatic speech recognition (ASR) based on neural networks is growing steadily, with equally increasingly reliable predictions. However, the evaluation of trained models is typically exclusively based on statistical metrics such as WER or CER, which do not provide any insight into the nature or impact of the errors produced when predicting transcripts from speech input. This work presents a selection of ASR model architectures that are pretrained on the German language and evaluates them on a benchmark of diverse test datasets. It identifies cross-architectural prediction errors, classifies those into categories and traces the sources of errors per category back into training data as well as other sources. Finally, it discusses solutions in order to create qualitatively better training datasets and more robust ASR systems.
PDF AbstractResults from the Paper
Ranked #1 on Automatic Speech Recognition (ASR) on VoxPopuli (using extra training data)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Speech Recognition | Common Voice German | Conformer Transducer (no LM) | Test WER | 6.28% | # 6 | ||
Automatic Speech Recognition (ASR) | HUI speech corpus | Conformer Transducer | WER (%) | 1.89% | # 1 | ||
Automatic Speech Recognition (ASR) | M-AILabs speech dataset | Conformer Transducer | WER (%) | 4.28% | # 1 | ||
Automatic Speech Recognition (ASR) | The Spoken Wikipedia Corpora | Conformer Transducer | WER (%) | 8.04% | # 1 | ||
Speech Recognition | TUDA | Conformer-Transducer (no LM) | Test WER | 5.82% | # 1 | ||
Automatic Speech Recognition (ASR) | Voxforge German | Conformer Transducer | WER (%) | 3.36% | # 1 | ||
Automatic Speech Recognition (ASR) | VoxPopuli | Conformer Transducer (German) | WER (%) | 8.98% | # 1 |