Automatic Spoken Language Identification Utilizing Acoustic and Phonetic Speech Information

1 Jun 2004 · Kim-Yung Eddie Wong, BEng(Hons), BIT ·

Automatic spoken Language Identification (LID) is the process of identifying the language spoken within an utterance. The challenge that this task presents is that no prior information is available indicating the content of the utterance or the identity of the speaker. The trend of globalization and the pervasive popularity of the Internet will amplify the need for the capabilities spoken language identification systems provide. A prominent application arises in call centers dealing with speakers speaking different languages. Another important application is to index or search huge speech data archives and corpora that contain multiple languages. The aim of this research is to develop techniques targeted at producing a fast and more accurate automatic spoken LID system compared to the previous National Institute of Standards and Technology (NIST) Language Recognition Evaluation. Acoustic and phonetic speech information are targeted as the most suitable features for representing the characteristics of a language. To model the acoustic speech features a Gaussian Mixture Model based approach is employed. Phonetic speech information is extracted using existing speech recognition technology. Various techniques to improve LID accuracy are also studied. One approach examined is the employment of Vocal Tract Length Normalization to reduce the speech variation caused by different speakers. A linear data fusion technique is adopted to combine the various aspects of information extracted from speech. As a result of this research, a LID system was implemented and presented for evaluation in the 2003 Language Recognition Evaluation conducted by the NIST.

PDF