Search Results for author: Bryan Catanzaro

Found 126 papers, 62 papers with code

Llama-Nemotron: Efficient Reasoning Models

no code implementations2 May 2025 Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Gerald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, Zijia Chen, Zhilin Wang, David Mosallanezhad, Adi Renduchintala, Haifeng Qian, Dima Rekesh, Fei Jia, Somshubra Majumdar, Vahid Noroozi, Wasi Uddin Ahmad, Sean Narenthiran, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Igor Gitman, Ivan Moshkov, Wei Du, Shubham Toshniwal, George Armstrong, Branislav Kisacanin, Matvei Novikov, Daria Gitman, Evelina Bakhturina, Jane Polak Scowcroft, John Kamalu, Dan Su, Kezhi Kong, Markus Kliegl, Rabeeh Karimi, Ying Lin, Sanjeev Satheesh, Jupinder Parmar, Pritam Gundecha, Brandon Norick, Joseph Jennings, Shrimai Prabhumoye, Syeda Nahida Akter, Mostofa Patwary, Abhinav Khattar, Deepak Narayanan, Roger Waleffe, Jimmy Zhang, Bor-Yiing Su, Guyue Huang, Terry Kong, Parth Chadha, Sahil Jain, Christine Harvey, Elad Segal, Jining Huang, Sergey Kashirsky, Robert McQueen, Izzy Putterman, George Lam, Arun Venkatesan, Sherry Wu, Vinh Nguyen, Manoj Kilaru, Andrew Wang, Anna Warno, Abhilash Somasamudramath, Sandip Bhaskar, Maka Dong, Nave Assaf, Shahar Mor, Omer Ullman Argov, Scot Junkin, Oleksandr Romanenko, Pedro Larroy, Marco Rovinelli, Viji Balas, Nicholas Edelman, Anahita Bhiwandiwalla, Muthu Subramaniam, Smita Ithape, Karthik Ramamoorthy, Yuting Wu, Suguna Varshini Velury, Omri Almog, Joyjit Daw, Denys Fridman, Erick Galinkin, Michael Evans, Shaona Ghosh, Katherine Luna, Leon Derczynski, Nikki Pope, Eileen Long, Seth Schneider, Guillermo Siman, Tomasz Grzegorzek, Pablo Ribalta, Monika Katariya, Chris Alexiuk, Joey Conway, Trisha Saar, Ann Guan, Krzysztof Pawelec, Shyamala Prayaga, Oleksii Kuchaiev, Boris Ginsburg, Oluwatobi Olabiyi, Kari Briski, Jonathan Cohen, Bryan Catanzaro, Jonah Alben, Yonatan Geifman, Eric Chung

We introduce the Llama-Nemotron series of models, an open family of heterogeneous reasoning models that deliver exceptional reasoning capabilities, inference efficiency, and an open license for enterprise use.

Knowledge Distillation Neural Architecture Search

Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning

no code implementations25 Apr 2025 Shaokun Zhang, Yi Dong, Jieyu Zhang, Jan Kautz, Bryan Catanzaro, Andrew Tao, Qingyun Wu, Zhiding Yu, Guilin Liu

In this work, we explore rule-based reinforcement learning to enhance tool-calling in LLMs, resulting in Nemotron-Research-Tool-N1, a series of tool-calling reasoning models.

reinforcement-learning Reinforcement Learning

From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models

no code implementations8 Apr 2025 Chejian Xu, Wei Ping, Peng Xu, Zihan Liu, Boxin Wang, Mohammad Shoeybi, Bo Li, Bryan Catanzaro

Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling, all of which require models to process and reason over long sequences of text and multimodal data.

In-Context Learning Instruction Following +1

Retro-Search: Exploring Untaken Paths for Deeper and Efficient Reasoning

no code implementations6 Apr 2025 Ximing Lu, Seungju Han, David Acuna, Hyunwoo Kim, JaeHun Jung, Shrimai Prabhumoye, Niklas Muennighoff, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi

For weak-to-strong improvement, we retrospectively revise R1-671B's traces from the OpenThoughts dataset using R1-distill-32B as the Retro-Search-er, a model 20x smaller.

Math

Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models

no code implementations4 Apr 2025 Nvidia, :, Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh, Ameya Sunil Mahabaleshwarkar, Andrew Tao, Anna Shors, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Buddharaju, Bobby Chen, Boris Ginsburg, Boxin Wang, Brandon Norick, Brian Butterfield, Bryan Catanzaro, Carlo del Mundo, chengyu dong, Christine Harvey, Christopher Parisien, Dan Su, Daniel Korzekwa, Danny Yin, Daria Gitman, David Mosallanezhad, Deepak Narayanan, Denys Fridman, Dima Rekesh, Ding Ma, Dmytro Pykhtar, Dong Ahn, Duncan Riach, Dusan Stosic, Eileen Long, Elad Segal, Ellie Evans, Eric Chung, Erick Galinkin, Evelina Bakhturina, Ewa Dobrowolska, Fei Jia, Fuxiao Liu, Gargi Prasad, Gerald Shen, Guilin Liu, Guo Chen, Haifeng Qian, Helen Ngo, Hongbin Liu, Hui Li, Igor Gitman, Ilia Karmanov, Ivan Moshkov, Izik Golan, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jarno Seppanen, Jason Lu, Jason Sewall, Jiaqi Zeng, Jiaxuan You, Jimmy Zhang, Jing Zhang, Jining Huang, Jinze Xue, Jocelyn Huang, Joey Conway, John Kamalu, Jon Barker, Jonathan Cohen, Joseph Jennings, Jupinder Parmar, Karan Sapra, Kari Briski, Kateryna Chumachenko, Katherine Luna, Keshav Santhanam, Kezhi Kong, Kirthi Sivamani, Krzysztof Pawelec, Kumar Anik, Kunlun Li, Lawrence McAfee, Leon Derczynski, Lindsey Pavao, Luis Vega, Lukas Voegtle, Maciej Bala, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Marcin Chochowski, Markus Kliegl, Marta Stepniewska-Dziubinska, Matthieu Le, Matvei Novikov, Mehrzad Samadi, Michael Andersch, Michael Evans, Miguel Martinez, Mike Chrzanowski, Mike Ranzinger, Mikolaj Blaz, Misha Smelyanskiy, Mohamed Fawzy, Mohammad Shoeybi, Mostofa Patwary, Nayeon Lee, Nima Tajbakhsh, Ning Xu, Oleg Rybakov, Oleksii Kuchaiev, Olivier Delalleau, Osvald Nitski, Parth Chadha, Pasha Shamis, Paulius Micikevicius, Pavlo Molchanov, Peter Dykas, Philipp Fischer, Pierre-Yves Aquilanti, Piotr Bialecki, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Rabeeh Karimi, Rahul Kandu, Ran El-Yaniv, Raviraj Joshi, Roger Waleffe, Ruoxi Zhang, Sabrina Kavanaugh, Sahil Jain, Samuel Kriman, Sangkug Lym, Sanjeev Satheesh, Saurav Muralidharan, Sean Narenthiran, Selvaraj Anandaraj, Seonmyeong Bak, Sergey Kashirsky, Seungju Han, Shantanu Acharya, Shaona Ghosh, Sharath Turuvekere Sreenivas, Sharon Clay, Shelby Thomas, Shrimai Prabhumoye, Shubham Pachori, Shubham Toshniwal, Shyamala Prayaga, Siddhartha Jain, Sirshak Das, Slawek Kierat, Somshubra Majumdar, Song Han, Soumye Singhal, Sriharsha Niverty, Stefania Alborghetti, Suseella Panguluri, Swetha Bhendigeri, Syeda Nahida Akter, Szymon Migacz, Tal Shiri, Terry Kong, Timo Roman, Tomer Ronen, Trisha Saar, Tugrul Konuk, Tuomas Rintamaki, Tyler Poon, Ushnish De, Vahid Noroozi, Varun Singh, Vijay Korthikanti, Vitaly Kurin, Wasi Uddin Ahmad, Wei Du, Wei Ping, Wenliang Dai, Wonmin Byeon, Xiaowei Ren, Yao Xu, Yejin Choi, Yian Zhang, Ying Lin, Yoshi Suhara, Zhiding Yu, Zhiqi Li, Zhiyu Li, Zhongbo Zhu, Zhuolin Yang, Zijia Chen

We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level.

Mamba

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

1 code implementation6 Mar 2025 Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, Bryan Catanzaro

Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities.

Audio captioning Language Modeling +3

UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

no code implementations2 Mar 2025 Alexander H. Liu, Sang-gil Lee, Chao-Han Huck Yang, Yuan Gong, Yu-Chiang Frank Wang, James R. Glass, Rafael Valle, Bryan Catanzaro

We show that with the appropriate design choices for pre-training, one can jointly learn a representation encoder and generative audio decoder that can be applied to both types of tasks.

Decoder Representation Learning +6

FeatSharp: Your Vision Model Features, Sharper

1 code implementation22 Feb 2025 Mike Ranzinger, Greg Heinrich, Pavlo Molchanov, Jan Kautz, Bryan Catanzaro, Andrew Tao

The feature maps of vision encoders are fundamental to myriad modern AI tasks, ranging from core perception algorithms (e. g. semantic segmentation, object detection, depth perception, etc.)

model object-detection +2

A2SB: Audio-to-Audio Schrodinger Bridges

no code implementations20 Jan 2025 Zhifeng Kong, Kevin J Shih, Weili Nie, Arash Vahdat, Sang-gil Lee, Joao Felipe Santos, Ante Jukic, Rafael Valle, Bryan Catanzaro

Audio in the real world may be perturbed due to numerous factors, causing the audio quality to be degraded.

Bandwidth Extension

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

1 code implementation30 Dec 2024 Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Amir Ali Bagherzadeh, Chuan Li, Rafael Valle, Bryan Catanzaro, Soujanya Poria

We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44. 1kHz audio in just 3. 7 seconds on a single A40 GPU.

Audio Generation

ETTA: Elucidating the Design Space of Text-to-Audio Models

no code implementations26 Dec 2024 Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, Bryan Catanzaro

Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts.

AudioCaps Audio captioning +4

AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

no code implementations19 Dec 2024 Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

In this paper, we introduce AceMath, a suite of frontier math models that excel in solving complex math problems, along with highly effective reward models capable of evaluating generated solutions and reliably identifying the correct ones.

Math

Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining

no code implementations18 Dec 2024 Steven Feng, Shrimai Prabhumoye, Kezhi Kong, Dan Su, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

We propose to design blends using downsampled data at a smaller scale of 1T tokens and then demonstrate effective scaling of our approach to larger token horizon of 15T tokens and larger model size of 25B model size.

RADIO Amplified: Improved Baselines for Agglomerative Vision Foundation Models

1 code implementation10 Dec 2024 Greg Heinrich, Mike Ranzinger, Hongxu, Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, Pavlo Molchanov

Agglomerative models have recently emerged as a powerful approach to training vision foundation models, leveraging multi-teacher distillation from existing models such as CLIP, DINO, and SAM.

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

no code implementations3 Dec 2024 Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

This unlocks state-of-the-art training over a long token horizon: an 8B parameter model trained for 15T tokens, of which 7. 2T came from our dataset, is better than the Llama 3. 1 8B model: +5 on MMLU, +3. 1 on ARC-Challenge, and +0. 5 on average across ten diverse tasks.

ARC MMLU

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

no code implementations4 Nov 2024 Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, Wei Ping

Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but it underperforms compared to a smaller CLIP retriever in cross-modal retrieval tasks due to the modality bias exhibited by MLLMs.

Cross-Modal Retrieval Information Retrieval +2

OMCAT: Omni Context Aware Transformer

no code implementations15 Oct 2024 Arushi Goel, Karan Sapra, Matthieu Le, Rafael Valle, Andrew Tao, Bryan Catanzaro

OCTAV (Omni Context and Temporal Audio Video) is a novel dataset designed to capture event transitions across audio and video.

Audio-visual Question Answering Audio-Visual Question Answering (AVQA) +5

MIND: Math Informed syNthetic Dialogues for Pretraining LLMs

no code implementations15 Oct 2024 Syeda Nahida Akter, Shrimai Prabhumoye, John Kamalu, Sanjeev Satheesh, Eric Nyberg, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

The utility of synthetic data to enhance pretraining data quality and hence to improve downstream task accuracy has been widely explored in recent large language models (LLMs).

GSM8K Math +2

Upcycling Large Language Models into Mixture of Experts

no code implementations10 Oct 2024 Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, Bryan Catanzaro

Upcycling pre-trained dense language models into sparse mixture-of-experts (MoE) models is an efficient approach to increase the model capacity of already trained models.

Mixture-of-Experts MMLU

PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation

1 code implementation2 Oct 2024 Mike Ranzinger, Jon Barker, Greg Heinrich, Pavlo Molchanov, Bryan Catanzaro, Andrew Tao

Various visual foundation models have distinct strengths and weaknesses, both of which can be improved through heterogeneous multi-teacher knowledge distillation without labels, termed "agglomerative models."

Knowledge Distillation

NVLM: Open Frontier-Class Multimodal LLMs

no code implementations17 Sep 2024 Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

We introduce NVLM 1. 0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e. g., GPT-4o) and open-access models (e. g., Llama 3-V 405B and InternVL 2).

Math Multimodal Reasoning +1

Effective Large Language Model Debugging with Best-first Tree Search

no code implementations26 Jul 2024 Jialin Song, Jonathan Raiman, Bryan Catanzaro

A fundamental difference with how an LLM writes code, compared to a human programmer, is that it cannot consistently spot and fix bugs.

Code Generation Language Modeling +2

ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

no code implementations19 Jul 2024 Peng Xu, Wei Ping, Xianchao Wu, Chejian Xu, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro

In this work, we introduce ChatQA 2, an Llama 3. 0-based model with a 128K context window, designed to bridge the gap between open-source LLMs and leading proprietary models (e. g., GPT-4-Turbo) in long-context understanding and retrieval-augmented generation (RAG) capabilities.

4k 8k +3

Compact Language Models via Pruning and Knowledge Distillation

1 code implementation19 Jul 2024 Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov

Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive.

Knowledge Distillation Language Modeling +2

Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models

no code implementations9 Jul 2024 Jupinder Parmar, Sanjev Satheesh, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

As language models have scaled both their number of parameters and pretraining dataset sizes, the computational cost for pretraining has become intractable except for the most well-resourced teams.

Data, Data Everywhere: A Guide for Pretraining Dataset Construction

no code implementations8 Jul 2024 Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Bo Liu, Aastha Jhunjhunwala, Zhilin Wang, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

The impressive capabilities of recent language models can be largely attributed to the multi-trillion token pretraining datasets that they are trained on.

Attribute

Improving Text-To-Audio Models with Synthetic Captions

1 code implementation18 Jun 2024 Zhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Rafael Valle, Soujanya Poria, Bryan Catanzaro

It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models.

Ranked #5 on Audio Generation on AudioCaps (CLAP_LAION metric)

AudioCaps Audio captioning +4

Nemotron-4 340B Technical Report

1 code implementation17 Jun 2024 Nvidia, :, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek, Robert Hero, Jining Huang, Vibhu Jawa, Joseph Jennings, Aastha Jhunjhunwala, John Kamalu, Sadaf Khan, Oleksii Kuchaiev, Patrick Legresley, Hui Li, Jiwei Liu, Zihan Liu, Eileen Long, Ameya Sunil Mahabaleshwarkar, Somshubra Majumdar, James Maki, Miguel Martinez, Maer Rodrigues de Melo, Ivan Moshkov, Deepak Narayanan, Sean Narenthiran, Jesus Navarro, Phong Nguyen, Osvald Nitski, Vahid Noroozi, Guruprasad Nutheti, Christopher Parisien, Jupinder Parmar, Mostofa Patwary, Krzysztof Pawelec, Wei Ping, Shrimai Prabhumoye, Rajarshi Roy, Trisha Saar, Vasanth Rao Naik Sabavat, Sanjeev Satheesh, Jane Polak Scowcroft, Jason Sewall, Pavel Shamis, Gerald Shen, Mohammad Shoeybi, Dave Sizer, Misha Smelyanskiy, Felipe Soares, Makesh Narsimhan Sreedhar, Dan Su, Sandeep Subramanian, Shengyang Sun, Shubham Toshniwal, Hao Wang, Zhilin Wang, Jiaxuan You, Jiaqi Zeng, Jimmy Zhang, Jing Zhang, Vivienne Zhang, Yian Zhang, Chen Zhu

We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward.

Synthetic Data Generation

CircuitVAE: Efficient and Scalable Latent Circuit Optimization

no code implementations13 Jun 2024 Jialin Song, Aidan Swope, Robert Kirby, Rajarshi Roy, Saad Godil, Jonathan Raiman, Bryan Catanzaro

Automatically designing fast and space-efficient digital circuits is challenging because circuits are discrete, must exactly implement the desired logic, and are costly to simulate.

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

no code implementations27 May 2024 Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

In this work, we introduce NV-Embed, incorporating architectural designs, training procedures, and curated datasets to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility.

Information Retrieval Language Modelling +6

Audio Dialogues: Dialogues dataset for audio and music understanding

no code implementations11 Apr 2024 Arushi Goel, Zhifeng Kong, Rafael Valle, Bryan Catanzaro

Existing datasets for audio understanding primarily focus on single-turn interactions (i. e. audio captioning, audio question answering) for describing audio in natural language, thus limiting understanding audio via interactive dialogue.

Audio captioning Audio Question Answering +4

ODIN: Disentangled Reward Mitigates Hacking in RLHF

no code implementations11 Feb 2024 Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro

In this work, we study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback (RLHF) on LLMs.

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

1 code implementation2 Feb 2024 Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, Bryan Catanzaro

Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs.

Acoustic Scene Classification Few-Shot Learning +6

Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages

no code implementations24 Jan 2024 Akshit Arora, Rohan Badlani, Sungwon Kim, Rafael Valle, Bryan Catanzaro

In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets.

Voice Cloning

ChatQA: Surpassing GPT-4 on Conversational QA and RAG

no code implementations18 Jan 2024 Zihan Liu, Wei Ping, Rajarshi Roy, Peng Xu, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro

In this work, we introduce ChatQA, a suite of models that outperform GPT-4 on retrieval-augmented generation (RAG) and conversational question answering (QA).

Ranked #6 on Question Answering on TriviaQA (using extra training data)

Conversational Question Answering RAG +1

InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining

1 code implementation11 Oct 2023 Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad Shoeybi, Bryan Catanzaro

After instruction tuning on Retro, InstructRetro demonstrates significant improvement over the instruction tuned GPT on a wide range of zero-shot tasks.

4k Decoder +4

Retrieval meets Long Context Large Language Models

no code implementations4 Oct 2023 Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina, Mohammad Shoeybi, Bryan Catanzaro

Perhaps surprisingly, we find that LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window via positional interpolation on long context tasks, while taking much less computation.

16k 4k +4

CleanUNet 2: A Hybrid Speech Denoising Model on Waveform and Spectrogram

no code implementations12 Sep 2023 Zhifeng Kong, Wei Ping, Ambrish Dantrey, Bryan Catanzaro

In this work, we present CleanUNet 2, a speech denoising model that combines the advantages of waveform denoiser and spectrogram denoiser and achieves the best of both worlds.

Denoising Speech Denoising +2

GraPhSyM: Graph Physical Synthesis Model

no code implementations7 Aug 2023 Ahmed Agiza, Rajarshi Roy, Teodor Dumitru Ene, Saad Godil, Sherief Reda, Bryan Catanzaro

Given a gate-level netlist of a circuit represented as a graph, GraPhSyM utilizes graph structure, connectivity, and electrical property features to predict the impact of physical synthesis transformations such as buffer insertion and gate sizing.

Graph Attention model

Progressive Learning of 3D Reconstruction Network from 2D GAN Data

no code implementations18 May 2023 Aysegul Dundar, Jun Gao, Andrew Tao, Bryan Catanzaro

In this work, to overcome these limitations of generated datasets, we have two main contributions which lead us to achieve state-of-the-art results on challenging objects: 1) A robust multi-stage learning scheme that gradually relies more on the models own predictions when calculating losses, 2) A novel adversarial learning pipeline with online pseudo-ground truth generations to achieve fine details.

3D Reconstruction

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

no code implementations ICCV 2023 Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, Yogesh Balaji

Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy.

Image Generation Text-to-Video Generation +1

Multilingual Multiaccented Multispeaker TTS with RADTTS

no code implementations24 Jan 2023 Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, Bryan Catanzaro

We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice.

Speech Synthesis

Evaluating Parameter Efficient Learning for Generation

no code implementations25 Oct 2022 Peng Xu, Mostofa Patwary, Shrimai Prabhumoye, Virginia Adams, Ryan J. Prenger, Wei Ping, Nayeon Lee, Mohammad Shoeybi, Bryan Catanzaro

For cross-domain and cross-dataset cases, we show that (a) Adapter (Houlsby et al., 2019) performs the best amongst all the PERMs studied here, and (b) it outperforms finetuning if the task dataset is below a certain size.

BigVGAN: A Universal Neural Vocoder with Large-Scale Training

5 code implementations9 Jun 2022 Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, Sungroh Yoon

Despite recent progress in generative adversarial network (GAN)-based vocoders, where the model generates raw waveform conditioned on acoustic features, it is challenging to synthesize high-fidelity audio for numerous speakers across various recording environments.

Ranked #2 on Speech Synthesis on LibriTTS (using extra training data)

Audio Generation Audio Synthesis +4

Factuality Enhanced Language Models for Open-Ended Text Generation

5 code implementations9 Jun 2022 Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale Fung, Mohammad Shoeybi, Bryan Catanzaro

In this work, we measure and improve the factual accuracy of large-scale LMs for open-ended text generation.

Misconceptions Sentence +2

PrefixRL: Optimization of Parallel Prefix Circuits using Deep Reinforcement Learning

no code implementations14 May 2022 Rajarshi Roy, Jonathan Raiman, Neel Kant, Ilyas Elkin, Robert Kirby, Michael Siu, Stuart Oberman, Saad Godil, Bryan Catanzaro

Deep Convolutional RL agents trained on this environment produce prefix adder circuits that Pareto-dominate existing baselines with up to 16. 0% and 30. 2% lower area for the same delay in the 32b and 64b settings respectively.

Deep Reinforcement Learning reinforcement-learning +1

Reducing Activation Recomputation in Large Transformer Models

3 code implementations10 May 2022 Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, Bryan Catanzaro

In this paper, we show how to significantly accelerate training of large transformer models by reducing activation recomputation.

Fine Detailed Texture Learning for 3D Meshes with Generative Models

no code implementations17 Mar 2022 Aysegul Dundar, Jun Gao, Andrew Tao, Bryan Catanzaro

The reconstruction is posed as an adaptation problem and is done progressively where in the first stage, we focus on learning accurate geometry, whereas in the second stage, we focus on learning the texture with a generative adversarial network.

Generative Adversarial Network

Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows

1 code implementation3 Mar 2022 Kevin J. Shih, Rafael Valle, Rohan Badlani, João Felipe Santos, Bryan Catanzaro

Despite recent advances in generative modeling for text-to-speech synthesis, these models do not yet have the same fine-grained adjustability of pitch-conditioned deterministic models such as FastPitch and FastSpeech2.

Speech Synthesis text-to-speech +2

Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases

no code implementations15 Dec 2021 Shrimai Prabhumoye, Rafal Kocielnik, Mohammad Shoeybi, Anima Anandkumar, Bryan Catanzaro

We then provide the LM with instruction that consists of this subset of labeled exemplars, the query text to be classified, a definition of bias, and prompt it to make a decision.

Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers

1 code implementation24 Nov 2021 John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, Bryan Catanzaro

AFNO is based on a principled foundation of operator learning which allows us to frame token mixing as a continuous global convolution without any dependence on the input resolution.

Computational Efficiency Operator learning +1

Efficient Token Mixing for Transformers via Adaptive Fourier Neural Operators

no code implementations ICLR 2022 John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, Bryan Catanzaro

AFNO is based on a principled foundation of operator learning which allows us to frame token mixing as a continuous global convolution without any dependence on the input resolution.

Computational Efficiency Operator learning +1

Guiding Global Placement With Reinforcement Learning

no code implementations6 Sep 2021 Robert Kirby, Kolby Nottingham, Rajarshi Roy, Saad Godil, Bryan Catanzaro

In this work we augment state-of-the-art, force-based global placement solvers with a reinforcement learning agent trained to improve the final detail placed Half Perimeter Wire Length (HPWL).

reinforcement-learning Reinforcement Learning +1

One TTS Alignment To Rule Them All

3 code implementations23 Aug 2021 Rohan Badlani, Adrian Łancucki, Kevin J. Shih, Rafael Valle, Wei Ping, Bryan Catanzaro

However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words.

All Speech Synthesis +1

Long-Short Transformer: Efficient Transformers for Language and Vision

3 code implementations NeurIPS 2021 Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar, Bryan Catanzaro

For instance, Transformer-LS achieves 0. 97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3x as long sequences compared to its full-attention version on the same hardware.

Language Modeling Language Modelling

View Generalization for Single Image Textured 3D Models

no code implementations CVPR 2021 Anand Bhattad, Aysegul Dundar, Guilin Liu, Andrew Tao, Bryan Catanzaro

We describe a cycle consistency loss that encourages model textures to be aligned, so as to encourage sharing.

3D geometry

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

3 code implementations9 Apr 2021 Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick Legresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, Matei Zaharia

In this paper, we show how different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with trillions of parameters.

Language Modeling Language Modelling

Can Q-Learning with Graph Networks Learn a Generalizable Branching Heuristic for a SAT Solver?

1 code implementation NeurIPS 2020 Vitaly Kurin, Saad Godil, Shimon Whiteson, Bryan Catanzaro

While more work is needed to apply Graph-Q-SAT to reduce wall clock time in modern SAT solving settings, it is a compelling proof-of-concept showing that RL equipped with Graph Neural Networks can learn a generalizable branching heuristic for SAT search.

Feature Engineering Q-Learning +1

Neural FFTs for Universal Texture Image Synthesis

no code implementations NeurIPS 2020 Morteza Mardani, Guilin Liu, Aysegul Dundar, Shiqiu Liu, Andrew Tao, Bryan Catanzaro

The conventional CNNs, recently adopted for synthesis, require to train and test on the same set of images and fail to generalize to unseen images.

Image Generation Texture Synthesis

Local Knowledge Powered Conversational Agents

1 code implementation20 Oct 2020 Sashank Santhanam, Wei Ping, Raul Puri, Mohammad Shoeybi, Mostofa Patwary, Bryan Catanzaro

State-of-the-art conversational agents have advanced significantly in conjunction with the use of large transformer-based language models.

Informativeness

DiffWave: A Versatile Diffusion Model for Audio Synthesis

11 code implementations ICLR 2021 Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, Bryan Catanzaro

In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation.

Audio Synthesis Diversity +2

Transposer: Universal Texture Synthesis Using Feature Maps as Transposed Convolution Filter

no code implementations14 Jul 2020 Guilin Liu, Rohan Taori, Ting-Chun Wang, Zhiding Yu, Shiqiu Liu, Fitsum A. Reda, Karan Sapra, Andrew Tao, Bryan Catanzaro

Specifically, we directly treat the whole encoded feature map of the input texture as transposed convolution filters and the features' self-similarity map, which captures the auto-correlation information, as input to the transposed convolution.

Texture Synthesis

Hierarchical Multi-Scale Attention for Semantic Segmentation

8 code implementations21 May 2020 Andrew Tao, Karan Sapra, Bryan Catanzaro

Multi-scale inference is commonly used to improve the results of semantic segmentation.

Ranked #6 on Semantic Segmentation on Cityscapes val (using extra training data)

Panoptic Segmentation

Large Scale Multi-Actor Generative Dialog Modeling

no code implementations ACL 2020 Alex Boyd, Raul Puri, Mohammad Shoeybi, Mostofa Patwary, Bryan Catanzaro

This work introduces the Generative Conversation Control model, an augmented and fine-tuned GPT-2 language model that conditions on past reference conversations to probabilistically model multi-turn conversations in the actor's persona.

Goal-Oriented Dialog Language Modelling

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

3 code implementations ICLR 2021 Rafael Valle, Kevin Shih, Ryan Prenger, Bryan Catanzaro

In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer.

 Ranked #1 on Text-To-Speech Synthesis on LJSpeech (Pleasantness MOS metric, using extra training data)

Speech Synthesis Style Transfer +3

Panoptic-based Image Synthesis

no code implementations CVPR 2020 Aysegul Dundar, Karan Sapra, Guilin Liu, Andrew Tao, Bryan Catanzaro

Conditional image synthesis for generating photorealistic images serves various applications for content editing to content generation.

Image Generation

Genome Variant Calling with a Deep Averaging Network

no code implementations13 Mar 2020 Nikolai Yakovenko, Avantika Lal, Johnny Israeli, Bryan Catanzaro

Variant calling, the problem of estimating whether a position in a DNA sequence differs from a reference sequence, given noisy, redundant, overlapping short sequences that cover that position, is fundamental to genomics.

Position

Training Question Answering Models From Synthetic Data

no code implementations EMNLP 2020 Raul Puri, Ryan Spring, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

On the SQuAD1. 1 question answering task, we achieve higher accuracy using solely synthetic questions and answers than when using the SQuAD1. 1 training set questions alone.

Answer Generation Data Augmentation +1

Neural ODEs for Image Segmentation with Level Sets

no code implementations25 Dec 2019 Rafael Valle, Fitsum Reda, Mohammad Shoeybi, Patrick Legresley, Andrew Tao, Bryan Catanzaro

We propose a novel approach for image segmentation that combines Neural Ordinary Differential Equations (NODEs) and the Level Set method.

Deep Learning Image Segmentation +5

Zero-shot Text Classification With Generative Language Models

no code implementations10 Dec 2019 Raul Puri, Bryan Catanzaro

This work investigates the use of natural language to enable zero-shot model adaptation to new tasks.

General Classification Language Modeling +4

Few-shot Video-to-Video Synthesis

6 code implementations NeurIPS 2019 Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, Bryan Catanzaro

To address the limitations, we propose a few-shot vid2vid framework, which learns to synthesize videos of previously unseen subjects or scenes by leveraging few example images of the target at test time.

Video-to-Video Synthesis

Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens

4 code implementations26 Oct 2019 Rafael Valle, Jason Li, Ryan Prenger, Bryan Catanzaro

Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data.

Rhythm Style Transfer

Can $Q$-Learning with Graph Networks Learn a Generalizable Branching Heuristic for a SAT Solver?

2 code implementations26 Sep 2019 Vitaly Kurin, Saad Godil, Shimon Whiteson, Bryan Catanzaro

While more work is needed to apply Graph-$Q$-SAT to reduce wall clock time in modern SAT solving settings, it is a compelling proof-of-concept showing that RL equipped with Graph Neural Networks can learn a generalizable branching heuristic for SAT search.

Feature Engineering Q-Learning +2

Improving SAT Solver Heuristics with Graph Networks and Reinforcement Learning

no code implementations25 Sep 2019 Vitaly Kurin, Saad Godil, Shimon Whiteson, Bryan Catanzaro

We present GQSAT, a branching heuristic in a Boolean SAT solver trained with value-based reinforcement learning (RL) using Graph Neural Networks for function approximation.

Feature Engineering reinforcement-learning +2

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

10 code implementations17 Sep 2019 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick Legresley, Jared Casper, Bryan Catanzaro

To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8. 3 billion parameter transformer language model similar to GPT-2 and a 3. 9 billion parameter model similar to BERT.

LAMBADA Language Modeling +2

Video Interpolation and Prediction with Unsupervised Landmarks

no code implementations6 Sep 2019 Kevin J. Shih, Aysegul Dundar, Animesh Garg, Robert Pottorf, Andrew Tao, Bryan Catanzaro

Prediction and interpolation for long-range video data involves the complex task of modeling motion trajectories for each visible object, occlusions and dis-occlusions, as well as appearance changes due to viewpoint and lighting.

Decoder Motion Interpolation +3

Unsupervised Video Interpolation Using Cycle Consistency

1 code implementation ICCV 2019 Fitsum A. Reda, Deqing Sun, Aysegul Dundar, Mohammad Shoeybi, Guilin Liu, Kevin J. Shih, Andrew Tao, Jan Kautz, Bryan Catanzaro

We further introduce a pseudo supervised loss term that enforces the interpolated frames to be consistent with predictions of a pre-trained interpolation model.

 Ranked #1 on Video Frame Interpolation on UCF101 (PSNR (sRGB) metric)

Triplet Video Frame Interpolation

Graphical Contrastive Losses for Scene Graph Parsing

3 code implementations CVPR 2019 Ji Zhang, Kevin J. Shih, Ahmed Elgammal, Andrew Tao, Bryan Catanzaro

The first, Entity Instance Confusion, occurs when the model confuses multiple instances of the same type of entity (e. g. multiple cups).

Relationship Detection Scene Graph Generation +1

Improving Semantic Segmentation via Video Propagation and Label Relaxation

5 code implementations CVPR 2019 Yi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, Shawn Newsam, Andrew Tao, Bryan Catanzaro

In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks.

Ranked #2 on Semantic Segmentation on KITTI Semantic Segmentation (using extra training data)

Segmentation Semantic Segmentation +1

Practical Text Classification With Large Pre-Trained Language Models

1 code implementation4 Dec 2018 Neel Kant, Raul Puri, Nikolai Yakovenko, Bryan Catanzaro

Multi-emotion sentiment classification is a natural language processing (NLP) problem with valuable use cases on real-world data.

Emotion Classification General Classification +5

Partial Convolution based Padding

4 code implementations28 Nov 2018 Guilin Liu, Kevin J. Shih, Ting-Chun Wang, Fitsum A. Reda, Karan Sapra, Zhiding Yu, Andrew Tao, Bryan Catanzaro

In this paper, we present a simple yet effective padding scheme that can be used as a drop-in module for existing convolutional neural networks.

General Classification Semantic Segmentation

SDCNet: Video Prediction Using Spatially-Displaced Convolution

2 code implementations2 Nov 2018 Fitsum A. Reda, Guilin Liu, Kevin J. Shih, Robert Kirby, Jon Barker, David Tarjan, Andrew Tao, Bryan Catanzaro

We present an approach for high-resolution video frame prediction by conditioning on both past frames and past optical flows.

Optical Flow Estimation Prediction +2

Introduction to the 1st Place Winning Model of OpenImages Relationship Detection Challenge

no code implementations1 Nov 2018 Ji Zhang, Kevin Shih, Andrew Tao, Bryan Catanzaro, Ahmed Elgammal

This article describes the model we built that achieved 1st place in the OpenImage Visual Relationship Detection Challenge on Kaggle.

Relationship Detection Visual Relationship Detection

WaveGlow: A Flow-based Generative Network for Speech Synthesis

2 code implementations31 Oct 2018 Ryan Prenger, Rafael Valle, Bryan Catanzaro

In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms.

Audio Synthesis regression +1

Video-to-Video Synthesis

10 code implementations NeurIPS 2018 Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, Bryan Catanzaro

We study the problem of video-to-video synthesis, whose goal is to learn a mapping function from an input source video (e. g., a sequence of semantic segmentation masks) to an output photorealistic video that precisely depicts the content of the source video.

2k Semantic Segmentation +3

Image Inpainting for Irregular Holes Using Partial Convolutions

60 code implementations ECCV 2018 Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, Bryan Catanzaro

Existing deep learning based image inpainting methods use a standard convolutional network over the corrupted image, using convolutional filter responses conditioned on both valid pixels as well as the substitute values in the masked holes (typically the mean value).

Image Inpainting valid

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs

21 code implementations CVPR 2018 Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, Bryan Catanzaro

We present a new method for synthesizing high-resolution photo-realistic images from semantic label maps using conditional generative adversarial networks (conditional GANs).

Conditional Image Generation Fundus to Angiography Generation +5

Malware Detection by Eating a Whole EXE

7 code implementations25 Oct 2017 Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, Charles Nicholas

In this work we introduce malware detection from raw byte sequences as a fruitful research area to the larger machine learning community.

Malware Detection

DSD: Dense-Sparse-Dense Training for Deep Neural Networks

2 code implementations15 Jul 2016 Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally

We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance.

8k Caption Generation +3

cuDNN: Efficient Primitives for Deep Learning

3 code implementations3 Oct 2014 Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer

To address this problem, we have created a library similar in intent to BLAS, with optimized routines for deep learning workloads.

Deep Learning

PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation

2 code implementations18 Nov 2009 Andreas Klöckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul Ivanov, Ahmed Fasih

In introducing PyCUDA and PyOpenCL, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems.

Distributed, Parallel, and Cluster Computing Software Engineering D.1.2

Cannot find the paper you are looking for? You can Submit a new open access paper.