Search Results for author: Rafael Valle

Found 30 papers, 14 papers with code

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

1 code implementation6 Mar 2025 Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, Bryan Catanzaro

Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities.

Audio captioning Language Modeling +3

UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

no code implementations2 Mar 2025 Alexander H. Liu, Sang-gil Lee, Chao-Han Huck Yang, Yuan Gong, Yu-Chiang Frank Wang, James R. Glass, Rafael Valle, Bryan Catanzaro

We show that with the appropriate design choices for pre-training, one can jointly learn a representation encoder and generative audio decoder that can be applied to both types of tasks.

Decoder Representation Learning +6

Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

no code implementations7 Feb 2025 Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Mikyas T. Desta, Roy Fejgin, Rafael Valle, Jason Li

While autoregressive speech token generation models produce speech with remarkable variety and naturalness, their inherent lack of controllability often results in issues such as hallucinations and undesired vocalizations that do not conform to conditioning inputs.

Automatic Speech Recognition Decoder +3

A2SB: Audio-to-Audio Schrodinger Bridges

no code implementations20 Jan 2025 Zhifeng Kong, Kevin J Shih, Weili Nie, Arash Vahdat, Sang-gil Lee, Joao Felipe Santos, Ante Jukic, Rafael Valle, Bryan Catanzaro

Audio in the real world may be perturbed due to numerous factors, causing the audio quality to be degraded.

Bandwidth Extension

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

1 code implementation30 Dec 2024 Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Amir Ali Bagherzadeh, Chuan Li, Rafael Valle, Bryan Catanzaro, Soujanya Poria

We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44. 1kHz audio in just 3. 7 seconds on a single A40 GPU.

Audio Generation

ETTA: Elucidating the Design Space of Text-to-Audio Models

no code implementations26 Dec 2024 Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, Bryan Catanzaro

Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts.

AudioCaps Audio captioning +4

OMCAT: Omni Context Aware Transformer

no code implementations15 Oct 2024 Arushi Goel, Karan Sapra, Matthieu Le, Rafael Valle, Andrew Tao, Bryan Catanzaro

OCTAV (Omni Context and Temporal Audio Video) is a novel dataset designed to capture event transitions across audio and video.

Audio-visual Question Answering Audio-Visual Question Answering (AVQA) +5

Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

no code implementations25 Jun 2024 Paarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Rafael Valle, Rohan Badlani, Boris Ginsburg

Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers.

Decoder Language Modeling +5

Improving Text-To-Audio Models with Synthetic Captions

1 code implementation18 Jun 2024 Zhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Rafael Valle, Soujanya Poria, Bryan Catanzaro

It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models.

Ranked #5 on Audio Generation on AudioCaps (CLAP_LAION metric)

AudioCaps Audio captioning +4

Audio Dialogues: Dialogues dataset for audio and music understanding

no code implementations11 Apr 2024 Arushi Goel, Zhifeng Kong, Rafael Valle, Bryan Catanzaro

Existing datasets for audio understanding primarily focus on single-turn interactions (i. e. audio captioning, audio question answering) for describing audio in natural language, thus limiting understanding audio via interactive dialogue.

Audio captioning Audio Question Answering +4

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

1 code implementation2 Feb 2024 Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, Bryan Catanzaro

Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs.

Acoustic Scene Classification Few-Shot Learning +6

Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages

no code implementations24 Jan 2024 Akshit Arora, Rohan Badlani, Sungwon Kim, Rafael Valle, Bryan Catanzaro

In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets.

Voice Cloning

SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

no code implementations14 Oct 2023 Paarth Neekhara, Shehzeen Hussain, Rafael Valle, Boris Ginsburg, Rishabh Ranjan, Shlomo Dubnov, Farinaz Koushanfar, Julian McAuley

In this work, instead of explicitly disentangling attributes with loss terms, we present a framework to train a controllable voice conversion model on entangled speech representations derived from self-supervised learning (SSL) and speaker verification models.

Self-Supervised Learning Speaker Verification +2

Multilingual Multiaccented Multispeaker TTS with RADTTS

no code implementations24 Jan 2023 Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, Bryan Catanzaro

We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice.

Speech Synthesis

SPACE: Speech-driven Portrait Animation with Controllable Expression

no code implementations ICCV 2023 Siddharth Gururani, Arun Mallya, Ting-Chun Wang, Rafael Valle, Ming-Yu Liu

It uses a multi-stage approach, combining the controllability of facial landmarks with the high-quality synthesis power of a pretrained face generator.

Portrait Animation

Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows

1 code implementation3 Mar 2022 Kevin J. Shih, Rafael Valle, Rohan Badlani, João Felipe Santos, Bryan Catanzaro

Despite recent advances in generative modeling for text-to-speech synthesis, these models do not yet have the same fine-grained adjustability of pitch-conditioned deterministic models such as FastPitch and FastSpeech2.

Speech Synthesis text-to-speech +2

One TTS Alignment To Rule Them All

3 code implementations23 Aug 2021 Rohan Badlani, Adrian Łancucki, Kevin J. Shih, Rafael Valle, Wei Ping, Bryan Catanzaro

However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words.

All Speech Synthesis +1

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

3 code implementations ICLR 2021 Rafael Valle, Kevin Shih, Ryan Prenger, Bryan Catanzaro

In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer.

 Ranked #1 on Text-To-Speech Synthesis on LJSpeech (Pleasantness MOS metric, using extra training data)

Speech Synthesis Style Transfer +3

Neural ODEs for Image Segmentation with Level Sets

no code implementations25 Dec 2019 Rafael Valle, Fitsum Reda, Mohammad Shoeybi, Patrick Legresley, Andrew Tao, Bryan Catanzaro

We propose a novel approach for image segmentation that combines Neural Ordinary Differential Equations (NODEs) and the Level Set method.

Deep Learning Image Segmentation +5

Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens

4 code implementations26 Oct 2019 Rafael Valle, Jason Li, Ryan Prenger, Bryan Catanzaro

Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data.

Rhythm Style Transfer

WaveGlow: A Flow-based Generative Network for Speech Synthesis

2 code implementations31 Oct 2018 Ryan Prenger, Rafael Valle, Bryan Catanzaro

In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms.

Audio Synthesis regression +1

TequilaGAN: How to easily identify GAN samples

no code implementations ICLR 2019 Rafael Valle, Wilson Cai, Anish Doshi

In this paper we show strategies to easily identify fake samples generated with the Generative Adversarial Network framework.

Generative Adversarial Network

Attacking Speaker Recognition With Deep Generative Models

no code implementations8 Jan 2018 Wilson Cai, Anish Doshi, Rafael Valle

In this paper we investigate the ability of generative adversarial networks (GANs) to synthesize spoofing attacks on modern speaker recognition systems.

Speaker Recognition

Character-Based Handwritten Text Transcription with Attention Networks

1 code implementation11 Dec 2017 Jason Poulos, Rafael Valle

When the sequence alignment is one-to-one, softmax attention is able to learn a more precise alignment at each step of the decoding, whereas the alignment generated by sigmoid attention is much less precise.

Decoder Handwritten Text Recognition +1

Missing Data Imputation for Supervised Learning

1 code implementation28 Oct 2016 Jason Poulos, Rafael Valle

Missing data imputation can help improve the performance of prediction models in situations where missing data hide useful information.

General Classification Imputation

Cannot find the paper you are looking for? You can Submit a new open access paper.