Improving the previous state-of-the-art Frisian ASR by fine-tuning XLS-R

Automatic Speech Recognition (ASR), a system that converts human speech to text, has a major role in digitizing human communication. Despite their significance, most of these systems are designed for higher-resourced languages, like English, Mandarin, or Spanish, leaving lower-resourced languages, such as Frisian, underrepresented. To address this issue, our paper introduces a fine-tuned ASR model based on the Wav2Vec 2.0 XLS-R architecture, trained on the Common Voice corpus version 12.0, to transcribe Frisian speech. With a learning rate of 8e-5, our proposed ASR system has achieved a 15.99% word error rate (WER), surpassing the previous state-of-the-art of 16.25% and serving as a benchmark for future research in this field.

PDF

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Speech Recognition Common Voice Frisian wav2vec2-large-xls-r-1b-frisian Test WER 15.99% # 1

Methods