Co-Separating Sounds of Visual Objects

ICCV 2019  ·  Ruohan Gao, Kristen Grauman ·

Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video clips, but this puts unwieldy restrictions on training data collection and may even prevent learning the properties of "true" mixed sounds. We introduce a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos. Our novel training objective requires that the deep neural network's separated audio for similar-looking objects be consistently identifiable, while simultaneously reproducing accurate video-level audio tracks for each source training pair. Our approach disentangles sounds in realistic test videos, even in cases where an object was not observed individually during training. We obtain state-of-the-art results on visually-guided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench datasets.

PDF Abstract ICCV 2019 PDF ICCV 2019 Abstract


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Audio Source Separation AudioSet Co-Separation SAR 13 # 1
SDR 4.26 # 2
SIR 7.07 # 1
Audio Denoising AV-Bench - Guitar Solo Co-Separation NSDR 11.9 # 1
Audio Denoising AV-Bench - Violin Yanni Co-Separation NSDR 8.53 # 1
Audio Denoising AV-Bench - Wooden Horse Co-Separation NSDR 14.5 # 1
Audio Source Separation MUSIC (multi-source) Co-Separation SAR 11.3 # 1
SIR 13.8 # 1


No methods listed for this paper. Add relevant methods here