Speaker Normalization for Self-supervised Speech Emotion Recognition

2 Feb 2022  ·  Itai Gat, Hagai Aronowitz, Weizhong Zhu, Edmilson Morais, Ron Hoory ·

Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model's ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation. We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Speech Emotion Recognition IEMOCAP TAP (Low Resource) AUC 0.649 # 1
Speech Emotion Recognition IEMOCAP TAP (5-fold) WA 0.742 # 9
Speech Emotion Recognition IEMOCAP TAP WA 0.81 # 3

Methods


No methods listed for this paper. Add relevant methods here