Room impulse response (RIR), which measures the sound propagation within an environment, is critical for synthesizing high-fidelity audio for a given environment.
We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework that solves the audio-visual sound source separation task through a generative manner.
To enable this, a variety of metrics to measure quality and intelligibility under different assumptions have been developed.
In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created while wearers shift their attention.
We propose an objective for perceptual quality based on temporal acoustic parameters.
We can add this criterion as an auxiliary loss to any model that produces speech, to optimize speech outputs to match the values of clean speech in these features.
Human perception of the complex world relies on a comprehensive analysis of multi-modal signals, and the co-occurrences of audio and video signals provide humans with rich cues.
By comparing complex- and real-valued versions of fundamental building blocks in the recently developed gated convolutional recurrent network (GCRN), we show how different mechanisms for basic blocks affect the performance.
Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements.
During training, our approach augments a model learning complex spectral mapping with a temporary submodel to predict the covariance of the enhancement error at each time-frequency bin.
We first identify key acoustic parameters that have been found to correlate well with voice quality (e. g. jitter, shimmer, and spectral flux) and then propose objective functions which are aimed at reducing the difference between clean speech and enhanced speech with respect to these features.
Audio quality assessment is critical for assessing the perceptual realism of sounds.
Modern end-to-end speech recognition models show astonishing results in transcribing audio signals into written text.
RemixIT is based on a continuous self-training scheme in which a pre-trained teacher model on out-of-domain data infers estimated pseudo-target signals for in-domain mixtures.
This paper investigates the impact of head movements on audio-visual speech enhancement (AVSE).
Specifically, a separation teacher model is pre-trained on an out-of-domain dataset and is used to infer estimated target signals for a batch of in-domain mixtures.
While self-supervised speech representation learning has been popular in the speech research community, very few works have comprehensively analyzed audio representation learning for non-speech audio tasks.
Ranked #5 on Audio Classification on Balanced Audio Set
3 code implementations • • Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei HUANG, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.
We show that neural networks trained using our framework produce scores that correlate well with subjective mean opinion scores (MOS) and are also competitive to methods such as DNSMOS, which explicitly relies on MOS from humans for training networks.
Supervised speech enhancement relies on parallel databases of degraded speech signals and their clean reference signals during training.
Our stateful implementation for online separation leads to a minor drop in performance compared to the offline model; 0. 8dB for monaural inputs and 0. 3dB for binaural inputs while reaching a real-time factor of 0. 65.
We also provide insights into the attributes of sound event representations that enable such efficient information transfer.
Subjective evaluations are critical for assessing the perceptual realism of sounds in audio-synthesis driven technologies like augmented and virtual reality.
Limiting the policy to this class reduces the problem to obtaining a queue switching policy at queue emptiness instants.
The Automated Speech Recognition (ASR) task has been a challenging domain especially for low data scenarios with few audio examples.
In addition, our approach effectively preserves the interaural cues, which improves the accuracy of sound localization.
Audio and Speech Processing Sound
An important problem in machine auditory perception is to recognize and detect sound events.
Ranked #31 on Audio Classification on AudioSet
Recognizing sounds is a key aspect of computational audio scene analysis and machine perception.
Ranked #22 on Audio Classification on AudioSet
Weakly supervised learning algorithms are critical for scaling audio event detection to several hundreds of sound categories.
In the last couple of years, weakly labeled learning for sound events has turned out to be an exciting approach for audio event detection.
In this work, we first describe a CNN based approach for weakly supervised training of audio events.
In this work we propose approaches to effectively transfer knowledge from weakly labeled web audio data.
Sound Multimedia Audio and Speech Processing
The framework crawls videos using search queries corresponding to 78 sound event labels drawn from three datasets.
In this paper we propose a novel learning framework called Supervised and Weakly Supervised Learning where the goal is to learn simultaneously from weakly and strongly labeled data.
The audio event detectors are trained on the labeled audio and ran on the unlabeled audio downloaded from YouTube.
In this paper we consider the problem of speech enhancement in real-world like conditions where multiple noises can simultaneously corrupt speech.
This helps in obtaining a complete description of the recording and is notable since temporal information was never known in the first place in weakly labeled data.
We also introduce a novel metric for ranking instances based on an index which depends upon the rank of weighted scores of test points among the weighted scores of training points.