no code implementations • 12 May 2025 • Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, S Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha, Gunhee Kim, Jun Du, Rafael Valle, Bryan Catanzaro
We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding.
1 code implementation • 6 Mar 2025 • Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, Bryan Catanzaro
Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities.
no code implementations • 2 Mar 2025 • Alexander H. Liu, Sang-gil Lee, Chao-Han Huck Yang, Yuan Gong, Yu-Chiang Frank Wang, James R. Glass, Rafael Valle, Bryan Catanzaro
We show that with the appropriate design choices for pre-training, one can jointly learn a representation encoder and generative audio decoder that can be applied to both types of tasks.
no code implementations • 7 Feb 2025 • Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Mikyas T. Desta, Roy Fejgin, Rafael Valle, Jason Li
While autoregressive speech token generation models produce speech with remarkable variety and naturalness, their inherent lack of controllability often results in issues such as hallucinations and undesired vocalizations that do not conform to conditioning inputs.
no code implementations • 20 Jan 2025 • Zhifeng Kong, Kevin J Shih, Weili Nie, Arash Vahdat, Sang-gil Lee, Joao Felipe Santos, Ante Jukic, Rafael Valle, Bryan Catanzaro
Audio in the real world may be perturbed due to numerous factors, causing the audio quality to be degraded.
1 code implementation • 30 Dec 2024 • Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Amir Ali Bagherzadeh, Chuan Li, Rafael Valle, Bryan Catanzaro, Soujanya Poria
We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44. 1kHz audio in just 3. 7 seconds on a single A40 GPU.
Ranked #2 on
Audio Generation
on AudioCaps
no code implementations • 26 Dec 2024 • Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, Bryan Catanzaro
Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts.
Ranked #1 on
Audio Generation
on AudioCaps
no code implementations • 15 Oct 2024 • Arushi Goel, Karan Sapra, Matthieu Le, Rafael Valle, Andrew Tao, Bryan Catanzaro
OCTAV (Omni Context and Temporal Audio Video) is a novel dataset designed to capture event transitions across audio and video.
Audio-visual Question Answering
Audio-Visual Question Answering (AVQA)
+5
1 code implementation • 2 Oct 2024 • Sreyan Ghosh, Sonal Kumar, Zhifeng Kong, Rafael Valle, Bryan Catanzaro, Dinesh Manocha
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
no code implementations • 25 Jun 2024 • Paarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Rafael Valle, Rohan Badlani, Boris Ginsburg
Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers.
1 code implementation • 18 Jun 2024 • Zhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Rafael Valle, Soujanya Poria, Bryan Catanzaro
It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models.
Ranked #5 on
Audio Generation
on AudioCaps
(CLAP_LAION metric)
no code implementations • 11 Apr 2024 • Arushi Goel, Zhifeng Kong, Rafael Valle, Bryan Catanzaro
Existing datasets for audio understanding primarily focus on single-turn interactions (i. e. audio captioning, audio question answering) for describing audio in natural language, thus limiting understanding audio via interactive dialogue.
1 code implementation • 2 Feb 2024 • Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, Bryan Catanzaro
Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs.
Ranked #1 on
Retrieval-augmented Few-shot In-context Audio Captioning
on AudioCaps
(using extra training data)
no code implementations • 24 Jan 2024 • Akshit Arora, Rohan Badlani, Sungwon Kim, Rafael Valle, Bryan Catanzaro
In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets.
no code implementations • 14 Oct 2023 • Paarth Neekhara, Shehzeen Hussain, Rafael Valle, Boris Ginsburg, Rishabh Ranjan, Shlomo Dubnov, Farinaz Koushanfar, Julian McAuley
In this work, instead of explicitly disentangling attributes with loss terms, we present a framework to train a controllable voice conversion model on entangled speech representations derived from self-supervised learning (SSL) and speaker verification models.
1 code implementation • NeurIPS 2023 • Sungwon Kim ~Sungwon_Kim2, Kevin J. Shih, Rohan Badlani, Joao Felipe Santos, Evelina Bakhturina, Mikyas T. Desta, Rafael Valle, Sungroh Yoon, Bryan Catanzaro
P-Flow comprises a speech-prompted text encoder for speaker adaptation and a flow matching generative decoder for high-quality and fast speech synthesis.
no code implementations • 14 Mar 2023 • Rohan Badlani, Akshit Arora, Subhankar Ghosh, Rafael Valle, Kevin J. Shih, João Felipe Santos, Boris Ginsburg, Bryan Catanzaro
We introduce VANI, a very lightweight multi-lingual accent controllable speech synthesis system.
no code implementations • 24 Jan 2023 • Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, Bryan Catanzaro
We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice.
no code implementations • ICCV 2023 • Siddharth Gururani, Arun Mallya, Ting-Chun Wang, Rafael Valle, Ming-Yu Liu
It uses a multi-stage approach, combining the controllability of facial landmarks with the high-quality synthesis power of a pretrained face generator.
1 code implementation • 3 Mar 2022 • Kevin J. Shih, Rafael Valle, Rohan Badlani, João Felipe Santos, Bryan Catanzaro
Despite recent advances in generative modeling for text-to-speech synthesis, these models do not yet have the same fine-grained adjustability of pitch-conditioned deterministic models such as FastPitch and FastSpeech2.
3 code implementations • 23 Aug 2021 • Rohan Badlani, Adrian Łancucki, Kevin J. Shih, Rafael Valle, Wei Ping, Bryan Catanzaro
However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words.
1 code implementation • ICML Workshop INNF 2021 • Kevin J. Shih, Rafael Valle, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro
This work introduces a predominantly parallel, end-to-end TTS model based on normalizing flows.
3 code implementations • ICLR 2021 • Rafael Valle, Kevin Shih, Ryan Prenger, Bryan Catanzaro
In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer.
Ranked #1 on
Text-To-Speech Synthesis
on LJSpeech
(Pleasantness MOS metric, using extra
training data)
no code implementations • 25 Dec 2019 • Rafael Valle, Fitsum Reda, Mohammad Shoeybi, Patrick Legresley, Andrew Tao, Bryan Catanzaro
We propose a novel approach for image segmentation that combines Neural Ordinary Differential Equations (NODEs) and the Level Set method.
4 code implementations • 26 Oct 2019 • Rafael Valle, Jason Li, Ryan Prenger, Bryan Catanzaro
Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data.
2 code implementations • 31 Oct 2018 • Ryan Prenger, Rafael Valle, Bryan Catanzaro
In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms.
Ranked #12 on
Speech Synthesis
on LibriTTS
no code implementations • ICLR 2019 • Rafael Valle, Wilson Cai, Anish Doshi
In this paper we show strategies to easily identify fake samples generated with the Generative Adversarial Network framework.
no code implementations • 8 Jan 2018 • Wilson Cai, Anish Doshi, Rafael Valle
In this paper we investigate the ability of generative adversarial networks (GANs) to synthesize spoofing attacks on modern speaker recognition systems.
1 code implementation • 11 Dec 2017 • Jason Poulos, Rafael Valle
When the sequence alignment is one-to-one, softmax attention is able to learn a more precise alignment at each step of the decoding, whereas the alignment generated by sigmoid attention is much less precise.
1 code implementation • 28 Oct 2016 • Jason Poulos, Rafael Valle
Missing data imputation can help improve the performance of prediction models in situations where missing data hide useful information.
Ranked #1 on
Imputation
on Adult