CNN-n-GRU: end-to-end speech emotion recognition from raw waveform signal using CNNs and gated recurrent unit networks

We present CNN-n-GRU, a new end-to-end (E2E) architecture built of an n-layer convolutional neural network (CNN) followed sequentially by an n-layer Gated Recurrent Unit (GRU) for speech emotion recognition. CNNs and RNNs both exhibited promising outcomes when fed raw waveform voice inputs. This inspired our idea to combine them into a single model to maximise their potential. Instead of using hand- crafted features or spectrograms, we train CNNs to recognise low-level speech representations from raw waveform, which allows the network to capture relevant narrow-band emotion characteristics. On the other hand, RNNs (GRUs in our case) can learn temporal characteristics, allowing the network to better capture the signal’s time-distributed features. Because a CNN can generate multiple levels of representation abstraction, we exploit early layers to extract high-level features, then to supply the appropriate input to subsequent RNN layers in order to aggregate long-term dependencies. By taking advantage of both CNNs and GRUs in a single model, the proposed architecture has important advantages over other models from the literature. The proposed model was evaluated using the TESS dataset and compared to state-of-the-art methods. Our experimental results demonstrate that the proposed model is more accurate than traditional classification approaches for speech emotion recognition.

PDF

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here