no code implementations • 27 Sep 2023 • Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
Room impulse response (RIR), which measures the sound propagation within an environment, is critical for synthesizing high-fidelity audio for a given environment.
no code implementations • 31 Jul 2023 • Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu
We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework that solves the audio-visual sound source separation task through a generative manner.
no code implementations • 4 Apr 2023 • Anurag Kumar, Ke Tan, Zhaoheng Ni, Pranay Manocha, Xiaohui Zhang, Ethan Henderson, Buye Xu
To enable this, a variety of metrics to measure quality and intelligibility under different assumptions have been developed.
1 code implementation • CVPR 2023 • Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created while wearers shift their attention.
2 code implementations • 16 Feb 2023 • Yunyang Zeng, Joseph Konan, Shuo Han, David Bick, Muqiao Yang, Anurag Kumar, Shinji Watanabe, Bhiksha Raj
We propose an objective for perceptual quality based on temporal acoustic parameters.
2 code implementations • 16 Feb 2023 • Muqiao Yang, Joseph Konan, David Bick, Yunyang Zeng, Shuo Han, Anurag Kumar, Shinji Watanabe, Bhiksha Raj
We can add this criterion as an auxiliary loss to any model that produces speech, to optimize speech outputs to match the values of clean speech in these features.
no code implementations • 4 Feb 2023 • Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
Human perception of the complex world relies on a comprehensive analysis of multi-modal signals, and the co-occurrences of audio and video signals provide humans with rich cues.
no code implementations • 11 Jan 2023 • Haibin Wu, Ke Tan, Buye Xu, Anurag Kumar, Daniel Wong
By comparing complex- and real-valued versions of fundamental building blocks in the recently developed gated convolutional recurrent network (GCRN), we show how different mechanisms for basic blocks affect the performance.
no code implementations • 20 Nov 2022 • Rodrigo Mira, Buye Xu, Jacob Donley, Anurag Kumar, Stavros Petridis, Vamsi Krishna Ithapu, Maja Pantic
Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements.
no code implementations • 16 Nov 2022 • Kuan-Lin Chen, Daniel D. E. Wong, Ke Tan, Buye Xu, Anurag Kumar, Vamsi Krishna Ithapu
During training, our approach augments a model learning complex spectral mapping with a temporary submodel to predict the covariance of the enhancement error at each time-frequency bin.
1 code implementation • 1 Jul 2022 • Muqiao Yang, Joseph Konan, David Bick, Anurag Kumar, Shinji Watanabe, Bhiksha Raj
We first identify key acoustic parameters that have been found to correlate well with voice quality (e. g. jitter, shimmer, and spectral flux) and then propose objective functions which are aimed at reducing the difference between clean speech and enhanced speech with respect to these features.
no code implementations • 24 Jun 2022 • Pranay Manocha, Anurag Kumar, Buye Xu, Anjali Menon, Israel D. Gebru, Vamsi K. Ithapu, Paul Calamia
Audio quality assessment is critical for assessing the perceptual realism of sounds.
1 code implementation • 24 Jun 2022 • Pranay Manocha, Anurag Kumar
Human judgments obtained through Mean Opinion Scores (MOS) are the most reliable way to assess the quality of speech signals.
no code implementations • 17 Feb 2022 • Anastasia Kuznetsova, Anurag Kumar, Jennifer Drexler Fox, Francis Tyers
Modern end-to-end speech recognition models show astonishing results in transcribing audio signals into written text.
1 code implementation • 17 Feb 2022 • Efthymios Tzinis, Yossi Adi, Vamsi Krishna Ithapu, Buye Xu, Paris Smaragdis, Anurag Kumar
RemixIT is based on a continuous self-training scheme in which a pre-trained teacher model on out-of-domain data infers estimated pseudo-target signals for in-domain mixtures.
no code implementations • 1 Feb 2022 • Zhiqi Kang, Mostafa Sadeghi, Radu Horaud, Xavier Alameda-Pineda, Jacob Donley, Anurag Kumar
This paper investigates the impact of head movements on audio-visual speech enhancement (AVSE).
1 code implementation • 19 Oct 2021 • Efthymios Tzinis, Yossi Adi, Vamsi K. Ithapu, Buye Xu, Anurag Kumar
Specifically, a separation teacher model is pre-trained on an out-of-domain dataset and is used to infer estimated target signals for a batch of in-domain mixtures.
no code implementations • 14 Oct 2021 • Sangeeta Srivastava, Yun Wang, Andros Tjandra, Anurag Kumar, Chunxi Liu, Kritika Singh, Yatharth Saraf
While self-supervised speech representation learning has been popular in the speech research community, very few works have comprehensively analyzed audio representation learning for non-speech audio tasks.
Ranked #5 on
Audio Classification
on Balanced Audio Set
3 code implementations • CVPR 2022 • Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei HUANG, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.
1 code implementation • NeurIPS 2021 • Pranay Manocha, Buye Xu, Anurag Kumar
We show that neural networks trained using our framework produce scores that correlate well with subjective mean opinion scores (MOS) and are also competitive to methods such as DNSMOS, which explicitly relies on MOS from humans for training networks.
no code implementations • 11 Sep 2021 • Yangyang Xia, Buye Xu, Anurag Kumar
Supervised speech enhancement relies on parallel databases of degraded speech signals and their clean reference signals during training.
no code implementations • 25 Jun 2021 • Ori Kabeli, Yossi Adi, Zhenyu Tang, Buye Xu, Anurag Kumar
Our stateful implementation for online separation leads to a minor drop in performance compared to the offline model; 0. 8dB for monaural inputs and 0. 3dB for binaural inputs while reaching a real-time factor of 0. 65.
no code implementations • 21 Jun 2021 • Anurag Kumar, Yun Wang, Vamsi Krishna Ithapu, Christian Fuegen
We also provide insights into the attributes of sound event representations that enable such efficient information transfer.
no code implementations • 29 May 2021 • Pranay Manocha, Anurag Kumar, Buye Xu, Anjali Menon, Israel D. Gebru, Vamsi K. Ithapu, Paul Calamia
Subjective evaluations are critical for assessing the perceptual realism of sounds in audio-synthesis driven technologies like augmented and virtual reality.
no code implementations • 24 May 2021 • Avinash Mohan, Arpan Chattopadhyay, Shivam Vinayak Vatsa, Anurag Kumar
Limiting the policy to this class reduces the problem to obtaining a queue switching policy at queue emptiness instants.
no code implementations • 6 Feb 2021 • Anastasia Kuznetsova, Anurag Kumar, Francis M. Tyers
The Automated Speech Recognition (ASR) task has been a challenging domain especially for low data scenarios with few audio examples.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
1 code implementation • 2 Sep 2020 • Ke Tan, Buye Xu, Anurag Kumar, Eliya Nachmani, Yossi Adi
In addition, our approach effectively preserves the interaural cues, which improves the accuracy of sound localization.
Audio and Speech Processing Sound
no code implementations • ICML 2020 • Anurag Kumar, Vamsi Krishna Ithapu
An important problem in machine auditory perception is to recognize and detect sound events.
Ranked #31 on
Audio Classification
on AudioSet
no code implementations • 29 May 2020 • Haytham M. Fayek, Anurag Kumar
Recognizing sounds is a key aspect of computational audio scene analysis and machine perception.
Ranked #22 on
Audio Classification
on AudioSet
no code implementations • 25 Oct 2019 • Anurag Kumar, Vamsi Krishna Ithapu
Weakly supervised learning algorithms are critical for scaling audio event detection to several hundreds of sound categories.
1 code implementation • 28th International Joint Conference on Artificial Intelligence 2019 • Anurag Kumar, Ankit Shah, Alex Hauptmann, Bhiksha Raj
In the last couple of years, weakly labeled learning for sound events has turned out to be an exciting approach for audio event detection.
1 code implementation • 24 Apr 2018 • Ankit Shah, Anurag Kumar, Alexander G. Hauptmann, Bhiksha Raj
In this work, we first describe a CNN based approach for weakly supervised training of audio events.
no code implementations • NIPS Workshop on Machine Learning for Audio 2018 • Benjamin Elizalde, Rohan Badlani, Ankit Shah, Anurag Kumar, and Bhiksha Raj.
Sounds are essential to how humans perceive and interact with the world.
1 code implementation • 4 Nov 2017 • Anurag Kumar, Maksim Khadkevich, Christian Fugen
In this work we propose approaches to effectively transfer knowledge from weakly labeled web audio data.
Sound Multimedia Audio and Speech Processing
no code implementations • 2 Nov 2017 • Rohan Badlani, Ankit Shah, Benjamin Elizalde, Anurag Kumar, Bhiksha Raj
The framework crawls videos using search queries corresponding to 78 sound event labels drawn from three datasets.
no code implementations • 9 Jul 2017 • Anurag Kumar, Bhiksha Raj
We propose that learning algorithms that can exploit weak labels offer an effective method to learn from web data.
no code implementations • 12 Nov 2016 • Anurag Kumar, Bhiksha Raj
In this paper we propose a novel learning framework called Supervised and Weakly Supervised Learning where the goal is to learn simultaneously from weakly and strongly labeled data.
no code implementations • 23 Sep 2016 • Anurag Kumar, Bhiksha Raj, Ndapandula Nakashole
In this paper we describe approaches for discovering acoustic concepts and relations in text.
no code implementations • 20 Sep 2016 • Benjamin Elizalde, Ankit Shah, Siddharth Dalmia, Min Hun Lee, Rohan Badlani, Anurag Kumar, Bhiksha Raj, Ian Lane
The audio event detectors are trained on the labeled audio and ran on the unlabeled audio downloaded from YouTube.
no code implementations • 19 Jul 2016 • Anurag Kumar, Bhiksha Raj
One of the most important problems in audio event detection research is absence of benchmark results for comparison with any proposed method.
Sound Multimedia
no code implementations • 9 Jul 2016 • Anurag Kumar, Bhiksha Raj
In this paper we propose strategies for estimating performance of a classifier when labels cannot be obtained for the whole test set.
no code implementations • 12 Jun 2016 • Anurag Kumar, Bhiksha Raj
Audio Event Detection is an important task for content analysis of multimedia data.
2 code implementations • 9 May 2016 • Anurag Kumar, Dinei Florencio
In this paper we consider the problem of speech enhancement in real-world like conditions where multiple noises can simultaneously corrupt speech.
Sound
no code implementations • 9 May 2016 • Anurag Kumar, Bhiksha Raj
This helps in obtaining a complete description of the recording and is notable since temporal information was never known in the first place in weakly labeled data.
no code implementations • 6 Feb 2015 • Anurag Kumar, Bhiksha Raj
We also introduce a novel metric for ranking instances based on an index which depends upon the rank of weighted scores of test points among the weighted scores of training points.