Shallow over Deep Neural Networks: A empirical analysis for human emotion classification using audio data

Human emotions can be identified in numerous ways, ranging from analyzing the tonal properties of speech to the facial expressions created before speech delivery and even the body gestures that can suggest various emotions without saying anything. Knowing the correct emotions of an individual can help is understand the situation and even react to it. This phenomena is even true for many feedback system used for day-to-day communication with humans, specifically the ones used for smart home solutions. The field of automated emotion recognition involves use-cases in different fields of research from computer vision, physiology to even artificial intelligence. This work focuses on classifying emotions into eight categories which are neutral, happy, sad, angry, calm, fearful, disgust and surprised based on the way those sentences have been spoken, using the “Ryerson Audio-Visual Database of Emotional Speech and Song” (RAVDESS). We propose a novel approach for emotion classification of audio conversations based on speech signals. Acoustic properties based emotion classification is independent of any spoken language and it can be used for cross-language emotion classification. The aim of the contribution was to develop a system capable of automatically recognising emotions for real-time speech. We performed several simulations and were able to achieve the highest accuracy of 82.99% with our shallow CNN model.

PDF

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Speech Emotion Recognition RAVDESS CNN-X (Shallow CNN) Accuracy 82.99% # 2
F1 Score 0.82 # 1
Precision 0.82 # 1
Recall 0.82 # 1

Methods


No methods listed for this paper. Add relevant methods here