Large Vocabulary Read Speech Corpora for Four Ethiopian Languages: Amharic, Tigrigna, Oromo, and Wolaytta

Automatic Speech Recognition (ASR) is one of the most important technologies to help people live a better life in the 21st century. However, its development requires a big speech corpus for a language. The development of such a corpus is expensive especially for under-resourced Ethiopian languages. To address this problem we have developed four medium-sized (longer than 22 hours each) speech corpora for four Ethiopian languages: Amharic, Tigrigna, Oromo, and Wolaytta. In a way of checking the usability of the corpora and deliver a baseline ASR for each language. In this paper, we present the corpora and the baseline ASR systems for each language. The word error rates (WERs) we achieved show that the corpora are usable for further investigation and we recommend the collection of text corpora to train strong language models for Oromo and Wolaytta compared to others.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here