Multi-stream Attention-based BLSTM with Feature Segmentation for Speech Emotion Recognition

This paper proposes a speech emotion recognition technique that considers the suprasegmental characteristics and temporal change of individual speech parameters. In recent years, speech emotion recognition using Bidirectional LSTM (BLSTM) has been studied actively because the model can focus on a particular temporal region that contains strong emotional characteristics. One of the model’s weaknesses is that it cannot consider the statistics of speech features, which are known to be effective for speech emotion recognition. Besides, this method cannot train individual attention parameters for different descriptors because it handles the input sequence by a single BLSTM. In this paper, we introduce feature segmentation and multi-stream processing into attention-based BLSTM to solve these problems. In addition, we employed data augmentation based on emotional speech synthesis in a training step. The classification experiments between four emotions (i.e., anger, joy, neutral, and sadness) using the Japanese Twitter-based Emotional Speech corpus (JTES) showed that the proposed method obtained a recognition accuracy of 73.4%, which is comparable to human evaluation (75.5%).

PDF

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here