Language-specific Effects on Automatic Speech Recognition Errors for World Englishes

Despite recent advancements in automated speech recognition (ASR) technologies, reports of unequal performance across speakers of different demographic groups abound. At the same time, the focus on performance metrics such as the Word Error Rate (WER) in prior studies limit the specificity and scope of recommendations that can be offered for system engineering to overcome these challenges. The current study bridges this gap by investigating the performance of Otter’s automatic captioning system on native and non-native English speakers of different language background through a linguistic analysis of segment-level errors. By examining language-specific error profiles for vowels and consonants motivated by linguistic theory, we find that certain categories of errors can be predicted from the phonological structure of a speaker’s native language.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here