MSP-Podcast (A large naturalistic speech emotional dataset)

The MSP-Podcast corpus contains speech segments from podcast recordings which are perceptually annotated using crowdsourcing. The collection of this corpus is an ongoing process. Version 1.7 of the corpus has 62,140 speaking turns (100hrs).

Key features of this corpus:

We download available audio recordings with common license. We only use the podcasts that have less restrictive licenses, so we can modify, sell and distribute the corpus (you can use it for commercial product!).
Most of the segments in a regular podcasts are neutral. We use machine learning techniques trained with available data to retrieve candidate segments. These segments are emotionally annotated with crowdsourcing. This approach allows us to spend our resources on speech segments that are likely to convey emotions.
We annotate categorical emotions and attribute based labels at the speaking turn label
This is an ongoing effort, where we currently have 62,140 speaking turns (100h). We collect approximately 10,000-13,000 new speaking turns per year. Our goal is to reach 400 hours.

Homepage

Benchmarks

Add a new result Link an existing benchmark

Task	Dataset Variant	Best Model
Speech Emotion Recognition	MSP-Podcast (Valence)	w2v2-L-robust-12
Speech Emotion Recognition	MSP-Podcast (Activation)	w2v2-L-robust-12
Speech Emotion Recognition	MSP-Podcast (Dominance)	w2v2-L-robust-12
Emotion Recognition	MSP-Podcast	w2v2-L-robust-12