Hybrid Fusion Based Interpretable Multimodal Emotion Recognition with Limited Labelled Data

24 Aug 2022  ·  Puneet Kumar, Sarthak Malik, Balasubramanian Raman, Xiaobai Li ·

This paper proposes a multimodal emotion recognition system, VIsual Spoken Textual Additive Net (VISTA Net), to classify emotions reflected by multimodal input containing image, speech, and text into discrete classes. A new interpretability technique, K-Average Additive exPlanation (KAAP), has also been developed that identifies important visual, spoken, and textual features leading to predicting a particular emotion class. The VISTA Net fuses information from image, speech, and text modalities using a hybrid of early and late fusion. It automatically adjusts the weights of their intermediate outputs while computing the weighted average. The KAAP technique computes the contribution of each modality and corresponding features toward predicting a particular emotion class. To mitigate the insufficiency of multimodal emotion datasets labeled with discrete emotion classes, we have constructed a large-scale IIT-R MMEmoRec dataset consisting of images, corresponding speech and text, and emotion labels ('angry,' 'happy,' 'hate,' and 'sad'). The VISTA Net has resulted in 95.99\% emotion recognition accuracy on the IIT-R MMEmoRec dataset on using visual, audio, and textual modalities, outperforming when using any one or two modalities.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here