Is Attention always needed? A Case Study on Language Identification from Speech

5 Oct 2021  ·  Atanu Mandal, Santanu Pal, Indranil Dutta, Mahidas Bhattacharya, Sudip Kumar Naskar ·

Language Identification (LID), a recommended initial step to Automatic Speech Recognition (ASR), is used to detect a spoken language from audio specimens. In state-of-the-art systems capable of multilingual speech processing, however, users have to explicitly set one or more languages before using them. LID, therefore, plays a very important role in situations where ASR based systems cannot parse the uttered language in multilingual contexts causing failure in speech recognition. We propose an attention based convolutional recurrent neural network (CRNN with Attention) that works on Mel-frequency Cepstral Coefficient (MFCC) features of audio specimens. Additionally, we reproduce some state-of-the-art approaches, namely Convolutional Neural Network (CNN) and Convolutional Recurrent Neural Network (CRNN), and compare them to our proposed method. We performed extensive evaluation on thirteen different Indian languages and our model achieves classification accuracy over 98%. Our LID model is robust to noise and provides 91.2% accuracy in a noisy scenario. The proposed model is easily extensible to new languages.

PDF Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Spoken language identification IndicTTS CNN Classification Accuracy 0.983 # 3
Spoken language identification IndicTTS CRNN Attention Classification Accuracy 0.987 # 1
Spoken language identification IndicTTS CRNN Classification Accuracy 0.987 # 1
Spoken language identification YouTube News dataset (No Noise) CRNN Accuracy 0.967 # 1
Spoken language identification YouTube News dataset (No Noise) CRNN Attention Accuracy 0.966 # 2
Spoken language identification YouTube News dataset (No Noise) CNN Accuracy 0.948 # 4
Spoken language identification YouTube News dataset (White Noise) CRNN Attention Accuracy 0.888 # 3
Spoken language identification YouTube News dataset (White Noise) CNN Accuracy 0.871 # 4
Spoken language identification YouTube News dataset (White Noise) CRNN Accuracy 0.912 # 1

Methods


No methods listed for this paper. Add relevant methods here