Explainability of Speech Recognition Transformers via Gradient-based Attention Visualization

In vision Transformers, attention visualization methods are used to generate heatmaps highlighting the class-corresponding areas in input images, which offers explanations on how the models make predictions. However, it is not so applicable for explaining automatic speech recognition (ASR) Transformers. An ASR Transformer makes a particular prediction for every input token to form a sentence, but a vision Transformer only makes an overall classification for the input data. Therefore, traditional attention visualization methods may fail in ASR Transformers. In this work, we propose a novel attention visualization method in ASR Transformers and try to explain which frames of the audio result in the output text. Inspired by the model explainability, we also explore ways of improving the effectiveness of the ASR model. Comparing with other Transformer attention visualization methods, our method is more efficient and intuitively understandable, which unravels the attention calculation from information flow of Transformer attention modules. In addition, we demonstrate the utilization of visualization result in three ways: (1) We visualize attention with respect to connectionist temporal classification (CTC) loss to train an ASR model with adversarial attention erasing regularization, which effectively decreases the word error rate (WER) of the model and improves its generalization capability. (2) We visualize the attention on some specific words, interpreting the model by effectively demonstrating the semantic and grammar relationships between these words. (3) Similarly, we analyze how the model manage to distinguish homophones, using contrastive explanation with respect to homophones.

PDF

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods