3 dataset results for segmentation AND Audio AND Chinese

…The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages and face poses.

36 PAPERS • NO BENCHMARKS YET

CH-SIMS

CH-SIMS is a Chinese single- and multimodal sentiment analysis dataset which contains 2,281 refined video segments in the wild with both multimodal and independent unimodal annotations.

15 PAPERS • 1 BENCHMARK

MIR-1K

…accompaniment and the singing voice recorded as left and right channels, respectively, Manual annotations of pitch contours in semitone, indices and types for unvoiced frames, lyrics, and vocal/non-vocal segments

20 PAPERS • NO BENCHMARKS YET

Datasets

3 dataset results for segmentation AND Audio AND Chinese