🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language

23 dataset results for segmentation AND Audio

AVSBench is a pixel-level audio-visual segmentation benchmark that provides ground truth labels for sounding objects. Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source 2) fully-supervised audio-visual segmentation with multiple sound sources 3) fully-supervised audio-visual semantic segmentation

10 PAPERS • NO BENCHMARKS YET

MSP-Podcast (A large naturalistic speech emotional dataset)

The MSP-Podcast corpus contains speech segments from podcast recordings which are perceptually annotated using crowdsourcing. The collection of this corpus is an ongoing process. Most of the segments in a regular podcasts are neutral. We use machine learning techniques trained with available data to retrieve candidate segments. These segments are emotionally annotated with crowdsourcing. This approach allows us to spend our resources on speech segments that are likely to convey emotions.

3 PAPERS • 4 BENCHMARKS

DCASE 2019 Mobile (TAU Urban Acoustic Scenes 2019 Mobile)

TAU Urban Acoustic Scenes 2019 Mobile development dataset consists of 10-seconds audio segments from 10 acoustic scenes: Airport Indoor shopping mall Metro station Pedestrian street Public square Street Each acoustic scene has 1440 segments (240 minutes of audio) recorded with device A (main device) and 108 segments of parallel audio (18 minutes) each recorded with devices B and C.

5 PAPERS • 1 BENCHMARK

TAU Urban Acoustic Scenes 2019

TAU Urban Acoustic Scenes 2019 development dataset consists of 10-seconds audio segments from 10 acoustic scenes: airport, indoor shopping mall, metro station, pedestrian street, public square, street Each acoustic scene has 1440 segments (240 minutes of audio). The dataset contains in total 40 hours of audio.

13 PAPERS • 2 BENCHMARKS

AVSpeech

…The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages and face poses.

35 PAPERS • NO BENCHMARKS YET

CAL500exp (CAL500 Expansion)

…The dataset consists of the same songs split into 3,223 acoustically homogenous segments of 3 to 16 seconds. The tag labels are annotated in the segment level instead of track level.

1 PAPER • NO BENCHMARKS YET

RESPIRATORY AND DRUG ACTUATION DATASET

…respiratory flow ranging on 180-240 L/min.Each audio recording was sampled with a 8KHz sampling frequency, as a mono channel WAV file, at 8-bit depth.The audio recordings were segmented The obtained segments (of non-mixed states) were of variable length and, for some methods, were further segmented into frames of fixed length for the purposes of feature extraction.The constructed database overall consisted of 193 drug actuation segments, 319 inhalation and 620 exhalation segments and 505 noise segments, ready to be used for audio sound recognition using different sets of features

1 PAPER • NO BENCHMARKS YET

CH-SIMS

CH-SIMS is a Chinese single- and multimodal sentiment analysis dataset which contains 2,281 refined video segments in the wild with both multimodal and independent unimodal annotations.

13 PAPERS • 1 BENCHMARK

EPIC-SOUNDS

EPIC-SOUNDS includes 78.4k categorised and 39.2k non-categorised segments of audible events and actions, distributed across 44 classes.

7 PAPERS • 2 BENCHMARKS

LSSED

…Each segment is annotated for the presence of 11 emotions (angry, neutral, fear, happy, sad, disappointed, bored, disgusted, excited, surprised, fear and other)

5 PAPERS • 1 BENCHMARK

WASABI

…lyrics encode an important part of the semantics of a song, the authors focus on the description of the methods they proposed to extract relevant information from the lyrics, such as their structure segmentation can be exploited by music search engines and music professionals (e.g. journalists, radio presenters) to better handle large collections of lyrics, allowing an intelligent browsing, categorization and segmentation

0 PAPER • NO BENCHMARKS YET

IEMOCAP (The Interactive Emotional Dyadic Motion Capture (IEMOCAP) Database)

…Each segment is annotated for the presence of 9 emotions (angry, excited, fear, sad, surprised, frustrated, happy, disappointed and neutral) as well as valence, arousal and dominance.

629 PAPERS • 3 BENCHMARKS

YTSeg

We present YTSeg, a topically and structurally diverse benchmark for the text segmentation task based on YouTube transcriptions.

1 PAPER • 2 BENCHMARKS

MeetingBank

…The datasets contains 6,892 segment-level summarization instances for training and evaluating of performance.

7 PAPERS • NO BENCHMARKS YET

Jamendo Corpus

…Segments of each song are annotated as “voice” (sung or spoken) or “no-voice”. The songs constitute a total of about 6 hours of music.

3 PAPERS • NO BENCHMARKS YET

VoxCeleb2

…Since the dataset is collected ‘in the wild’, the speech segments are corrupted with real world noise including laughter, cross-talk, channel effects, music and other sounds.

491 PAPERS • 5 BENCHMARKS

Localized Narratives

…This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data.

52 PAPERS • 5 BENCHMARKS

MIR-1K

…accompaniment and the singing voice recorded as left and right channels, respectively, Manual annotations of pitch contours in semitone, indices and types for unvoiced frames, lyrics, and vocal/non-vocal segments

20 PAPERS • NO BENCHMARKS YET

Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2 (BIWI 3D)

…In order to ease automatic speech segmentation, we carried out the recordings in a anechoic room, with walls covered by sound wave-absorbing materials.

5 PAPERS • 1 BENCHMARK

GLips (German Lips)

…Additionally, the complete TextGrid files containing the segmentation information of those sessions are also included. The size of the uncompressed dataset is 15GB.

5 PAPERS • NO BENCHMARKS YET

FSC-P2 (Fearless Steps Challenge Phase2)

…This (FS-02) edition of the FEARLESS STEPS Challenge includes the following 6 tasks --- TASK 1: Speech Activity Detection (SAD) TASK 2: Speaker Identification (using Speaker Segments Track 2: ASR using Diarized Segments (ASR_track2)

1 PAPER • NO BENCHMARKS YET

TUT-SED Synthetic 2016

…Mixtures were created by randomly selecting event instance and from it, randomly, a segment of length 3-15 seconds. Between events, random length silent region was introduced.

7 PAPERS • NO BENCHMARKS YET

BAVL (Blind Audio-Visual Localization (BAVL))

…"Multimodal analysis for identification and segmentation of moving-sounding objects."IEEE Transactions on Multimedia 15.2 (2013): 378-390. [3] Li, Kai, Jun Ye, and Kien A. Hua.

0 PAPER • NO BENCHMARKS YET

Datasets

23 dataset results for segmentation AND Audio