Speech Emotion Recognition

98 papers with code • 14 benchmarks • 18 datasets

Speech Emotion Recognition is a task of speech processing and computational paralinguistics that aims to recognize and categorize the emotions expressed in spoken language. The goal is to determine the emotional state of a speaker, such as happiness, anger, sadness, or frustration, from their speech patterns, such as prosody, pitch, and rhythm.

For multimodal emotion recognition, please upload your result to Multimodal Emotion Recognition on IEMOCAP

Libraries

Use these libraries to find Speech Emotion Recognition models and implementations

Latest papers with no code

Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer

no code yet • 26 Mar 2024

In this paper, we propose a method to improve the accuracy of speech emotion recognition (SER) by using vision transformer (ViT) to attend to the correlation of frequency (y-axis) with time (x-axis) in spectrogram and transferring positional information between ViT through knowledge transfer.

emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition

no code yet • 21 Mar 2024

This study presents emoDARTS, a DARTS-optimised joint CNN and Sequential Neural Network (SeqNN: LSTM, RNN) architecture that enhances SER performance.

The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data

no code yet • 21 Mar 2024

In this short white paper, to encourage researchers with limited access to large-datasets, the organizers first outline several open-source datasets that are available to the community, and for the duration of the workshop are making several propriety datasets available.

Speech emotion recognition from voice messages recorded in the wild

no code yet • 4 Mar 2024

The pre-trained Unispeech-L model and its combination with eGeMAPS achieved the highest results, with 61. 64% and 55. 57% Unweighted Accuracy (UA) for 3-class valence and arousal prediction respectively, a 10% improvement over baseline models.

SEGAA: A Unified Approach to Predicting Age, Gender, and Emotion in Speech

no code yet • 1 Mar 2024

Exploring deep learning models for these predictions involves comparing single, multi-output, and sequential models highlighted in this paper.

Mixer is more than just a model

no code yet • 28 Feb 2024

In the field of computer vision, MLP-Mixer is noted for its ability to extract data information from both channel and token perspectives, effectively acting as a fusion of channel and token information.

Parameter Efficient Finetuning for Speech Emotion Recognition and Domain Adaptation

no code yet • 19 Feb 2024

Foundation models have shown superior performance for speech emotion recognition (SER).

Persian Speech Emotion Recognition by Fine-Tuning Transformers

no code yet • 11 Feb 2024

Despite extensive discussions and global-scale efforts to enhance these systems, the application of this innovative and effective approach has received less attention in the context of Persian speech emotion recognition.

CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using Cochlear Cepstrum-based Masking for Speech Emotion Recognition

no code yet • 10 Feb 2024

Self-supervised learning (SSL) for automated speech recognition in terms of its emotional content, can be heavily degraded by the presence noise, affecting the efficiency of modeling the intricate temporal and spectral informative structures of speech.

Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study on Speech Emotion Recognition

no code yet • 4 Feb 2024

Through a comparative experiment and a layer-wise accuracy analysis on two distinct corpora, IEMOCAP and ESD, we explore differences between AWEs and raw self-supervised representations, as well as the proper utilization of AWEs alone and in combination with word embeddings.