no code implementations • 28 Aug 2024 • Ke Chen, Jiaqi Su, Taylor Berg-Kirkpatrick, Shlomo Dubnov, Zeyu Jin
In this paper, we present a novel data simulation pipeline that produces diverse training data from a range of acoustic environments and content, and propose new training paradigms to improve quality of a general speech separation model.
1 code implementation • 24 Aug 2024 • Zeyu Jin, Jia Jia, Qixin Wang, Kehan Li, Shuoyi Zhou, Songtao Zhou, Xiaoyu Qin, Zhiyong Wu
To tackle this challenge, we propose an automatic speech annotation system for expressiveness interpretation that annotates in-the-wild speech clips with expressive and vivid human language descriptions.
no code implementations • 24 May 2024 • Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Utkarsh Tyagi, Oriol Nieto, Zeyu Jin, Dinesh Manocha
From our analysis, we show that: (1) The community's efforts have been primarily targeted towards reducing hallucinations related to visual recognition (VR) prompts (e. g., prompts that only require describing the image), thereby ignoring hallucinations for cognitive prompts (e. g., prompts that require additional skills like reasoning on contents of the image).
no code implementations • 3 Feb 2024 • Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Ramaneswaran S, Deepali Aneja, Zeyu Jin, Ramani Duraiswami, Dinesh Manocha
Our findings reveal that responses generated solely from pre-trained knowledge consistently outperform responses by models that learn any form of new knowledge from IT on open-source datasets.
no code implementations • 2 Jun 2023 • Oriol Nieto, Zeyu Jin, Franck Dernoncourt, Justin Salamon
Spoken language recognition (SLR) is the task of automatically identifying the language present in a speech signal.
no code implementations • 27 Jun 2022 • Pranay Manocha, Zeyu Jin, Adam Finkelstein
Many audio processing tasks require perceptual assessment.
no code implementations • 28 Apr 2022 • Nikhil Kandpal, Oriol Nieto, Zeyu Jin
Consumer-grade music recordings such as those captured by mobile devices typically contain distortions in the form of background noise, reverb, and microphone-induced EQ.
3 code implementations • 6 Mar 2022 • Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu Jin, Yonatan Bisk
The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios.
1 code implementation • 5 Oct 2021 • Max Morrison, Zeyu Jin, Nicholas J. Bryan, Juan-Pablo Caceres, Bryan Pardo
Modifying the pitch and timing of an audio signal are fundamental audio editing operations with applications in speech manipulation, audio-visual synchronization, and singing voice editing and synthesis.
no code implementations • 2 Sep 2021 • Shuqi Dai, Zeyu Jin, Celso Gomes, Roger B. Dannenberg
Recent advances in deep learning have expanded possibilities to generate music, but generating a customizable full piece of music with consistent long-term structure remains a challenge.
no code implementations • 16 Feb 2021 • Max Morrison, Lucas Rencker, Zeyu Jin, Nicholas J. Bryan, Juan-Pablo Caceres, Bryan Pardo
Text-based speech editors expedite the process of editing speech recordings by permitting editing via intuitive cut, copy, and paste operations on a speech transcript.
1 code implementation • 9 Feb 2021 • Pranay Manocha, Zeyu Jin, Richard Zhang, Adam Finkelstein
The DPAM approach of Manocha et al. learns a full-reference metric trained directly on human judgments, and thus correlates well with human perception.
no code implementations • 9 Aug 2020 • Jongpil Lee, Nicholas J. Bryan, Justin Salamon, Zeyu Jin, Juhan Nam
For this task, it is typically necessary to define a similarity metric to compare one recording to another.
no code implementations • 9 Aug 2020 • Jongpil Lee, Nicholas J. Bryan, Justin Salamon, Zeyu Jin, Juhan Nam
For this, we (1) outline past work on the relationship between metric learning and classification, (2) extend this relationship to multi-label data by exploring three different learning approaches and their disentangled versions, and (3) evaluate all models on four tasks (training time, similarity retrieval, auto-tagging, and triplet prediction).
no code implementations • 7 Aug 2020 • Max Morrison, Zeyu Jin, Justin Salamon, Nicholas J. Bryan, Gautham J. Mysore
Speech synthesis has recently seen significant improvements in fidelity, driven by the advent of neural vocoders and neural prosody generators.
1 code implementation • 10 Jun 2020 • Jiaqi Su, Zeyu Jin, Adam Finkelstein
Real-world audio recordings are often degraded by factors such as noise, reverberation, and equalization distortion.
1 code implementation • 15 Apr 2020 • Kaizhi Qian, Zeyu Jin, Mark Hasegawa-Johnson, Gautham J. Mysore
Recently, AutoVC, a conditional autoencoders (CAEs) based method achieved state-of-the-art results by disentangling the speaker identity and speech content using information-constraining bottlenecks, and it achieves zero-shot conversion by swapping in a different speaker's identity embedding to synthesize a new voice.
1 code implementation • 13 Jan 2020 • Pranay Manocha, Adam Finkelstein, Zeyu Jin, Nicholas J. Bryan, Richard Zhang, Gautham J. Mysore
Assessment of many audio processing tasks relies on subjective evaluation which is time-consuming and expensive.
1 code implementation • 4 Jun 2019 • Ohad Fried, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shechtman, Dan B. Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, Maneesh Agrawala
To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material.
no code implementations • ICASSP 2019 • Jiaqi Su, Adam Finkelstein, Zeyu Jin
This paper introduces a deep learning approach to enhance speech recordings made in a specific environment.