AVSBench is a pixel-level audio-visual segmentation benchmark that provides ground truth labels for sounding objects. Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source 2) fully-supervised audio-visual segmentation with multiple sound sources 3) fully-supervised audio-visual semantic segmentation
10 PAPERS • NO BENCHMARKS YET
The MSP-Podcast corpus contains speech segments from podcast recordings which are perceptually annotated using crowdsourcing. The collection of this corpus is an ongoing process. Most of the segments in a regular podcasts are neutral. We use machine learning techniques trained with available data to retrieve candidate segments. These segments are emotionally annotated with crowdsourcing. This approach allows us to spend our resources on speech segments that are likely to convey emotions.
3 PAPERS • 4 BENCHMARKS
TAU Urban Acoustic Scenes 2019 Mobile development dataset consists of 10-seconds audio segments from 10 acoustic scenes: Airport Indoor shopping mall Metro station Pedestrian street Public square Street Each acoustic scene has 1440 segments (240 minutes of audio) recorded with device A (main device) and 108 segments of parallel audio (18 minutes) each recorded with devices B and C.
5 PAPERS • 1 BENCHMARK
TAU Urban Acoustic Scenes 2019 development dataset consists of 10-seconds audio segments from 10 acoustic scenes: airport, indoor shopping mall, metro station, pedestrian street, public square, street Each acoustic scene has 1440 segments (240 minutes of audio). The dataset contains in total 40 hours of audio.
13 PAPERS • 2 BENCHMARKS
…The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages and face poses.
35 PAPERS • NO BENCHMARKS YET
…The dataset consists of the same songs split into 3,223 acoustically homogenous segments of 3 to 16 seconds. The tag labels are annotated in the segment level instead of track level.
1 PAPER • NO BENCHMARKS YET
…respiratory flow ranging on 180-240 L/min.Each audio recording was sampled with a 8KHz sampling frequency, as a mono channel WAV file, at 8-bit depth.The audio recordings were segmented The obtained segments (of non-mixed states) were of variable length and, for some methods, were further segmented into frames of fixed length for the purposes of feature extraction.The constructed database overall consisted of 193 drug actuation segments, 319 inhalation and 620 exhalation segments and 505 noise segments, ready to be used for audio sound recognition using different sets of features
CH-SIMS is a Chinese single- and multimodal sentiment analysis dataset which contains 2,281 refined video segments in the wild with both multimodal and independent unimodal annotations.
13 PAPERS • 1 BENCHMARK
EPIC-SOUNDS includes 78.4k categorised and 39.2k non-categorised segments of audible events and actions, distributed across 44 classes.
7 PAPERS • 2 BENCHMARKS
…Each segment is annotated for the presence of 11 emotions (angry, neutral, fear, happy, sad, disappointed, bored, disgusted, excited, surprised, fear and other)
…lyrics encode an important part of the semantics of a song, the authors focus on the description of the methods they proposed to extract relevant information from the lyrics, such as their structure segmentation can be exploited by music search engines and music professionals (e.g. journalists, radio presenters) to better handle large collections of lyrics, allowing an intelligent browsing, categorization and segmentation
0 PAPER • NO BENCHMARKS YET
…Each segment is annotated for the presence of 9 emotions (angry, excited, fear, sad, surprised, frustrated, happy, disappointed and neutral) as well as valence, arousal and dominance.
629 PAPERS • 3 BENCHMARKS
We present YTSeg, a topically and structurally diverse benchmark for the text segmentation task based on YouTube transcriptions.
1 PAPER • 2 BENCHMARKS
…The datasets contains 6,892 segment-level summarization instances for training and evaluating of performance.
7 PAPERS • NO BENCHMARKS YET
…Segments of each song are annotated as “voice” (sung or spoken) or “no-voice”. The songs constitute a total of about 6 hours of music.
3 PAPERS • NO BENCHMARKS YET
…Since the dataset is collected ‘in the wild’, the speech segments are corrupted with real world noise including laughter, cross-talk, channel effects, music and other sounds.
491 PAPERS • 5 BENCHMARKS
…This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data.
52 PAPERS • 5 BENCHMARKS
…accompaniment and the singing voice recorded as left and right channels, respectively, Manual annotations of pitch contours in semitone, indices and types for unvoiced frames, lyrics, and vocal/non-vocal segments
20 PAPERS • NO BENCHMARKS YET
…In order to ease automatic speech segmentation, we carried out the recordings in a anechoic room, with walls covered by sound wave-absorbing materials.
…Additionally, the complete TextGrid files containing the segmentation information of those sessions are also included. The size of the uncompressed dataset is 15GB.
5 PAPERS • NO BENCHMARKS YET
…This (FS-02) edition of the FEARLESS STEPS Challenge includes the following 6 tasks --- TASK 1: Speech Activity Detection (SAD) TASK 2: Speaker Identification (using Speaker Segments Track 2: ASR using Diarized Segments (ASR_track2)
…Mixtures were created by randomly selecting event instance and from it, randomly, a segment of length 3-15 seconds. Between events, random length silent region was introduced.
…"Multimodal analysis for identification and segmentation of moving-sounding objects."IEEE Transactions on Multimedia 15.2 (2013): 378-390. [3] Li, Kai, Jun Ye, and Kien A. Hua.