…The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages and face poses.
36 PAPERS • NO BENCHMARKS YET
CH-SIMS is a Chinese single- and multimodal sentiment analysis dataset which contains 2,281 refined video segments in the wild with both multimodal and independent unimodal annotations.
15 PAPERS • 1 BENCHMARK
…accompaniment and the singing voice recorded as left and right channels, respectively, Manual annotations of pitch contours in semitone, indices and types for unvoiced frames, lyrics, and vocal/non-vocal segments
20 PAPERS • NO BENCHMARKS YET