1 code implementation • CVPR 2023 • Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, Ira Kemelmacher-Shlizerman
Given two images depicting a person and a garment worn by another person, our goal is to generate a visualization of how the garment might look on the input person.
no code implementations • 8 Feb 2023 • Qingqing Huang, Daniel S. Park, Tao Wang, Timo I. Denk, Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, Jesse Engel, Quoc V. Le, William Chan, Zhifeng Chen, Wei Han
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts.
Ranked #6 on Text-to-Music Generation on MusicCaps (FAD metric)
1 code implementation • 20 Dec 2022 • Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, RJ Mical, Mohammad Norouzi, Noah Constant
In the text-only domain, we find that character-aware models provide large gains on a novel spelling task (WikiSpell).
no code implementations • CVPR 2023 • Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont-Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah Laszlo, David J. Fleet, Radu Soricut, Jason Baldridge, Mohammad Norouzi, Peter Anderson, William Chan
Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.
no code implementations • 6 Oct 2022 • Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, Mohammad Norouzi
We demonstrate that stochastic conditioning significantly improves the 3D consistency of a naive sampler for an image-to-image diffusion model, which involves conditioning on a single fixed view.
no code implementations • 5 Oct 2022 • Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, Tim Salimans
We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models.
Ranked #1 on Video Generation on LAION-400M
5 code implementations • 23 May 2022 • Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, Mohammad Norouzi
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
Ranked #16 on Text-to-Image Generation on MS COCO (using extra training data)
4 code implementations • 7 Apr 2022 • Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, David J. Fleet
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
no code implementations • 11 Feb 2022 • Daniel Watson, William Chan, Jonathan Ho, Mohammad Norouzi
We introduce Differentiable Diffusion Sampler Search (DDSS): a method that optimizes fast samplers for any pre-trained diffusion model by differentiating through sample quality scores.
5 code implementations • 10 Nov 2021 • Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans, David J. Fleet, Mohammad Norouzi
We expect this standardized evaluation protocol to play a role in advancing image-to-image translation research.
Ranked #1 on Colorization on ImageNet ctest10k
no code implementations • ICLR 2022 • Daniel Watson, William Chan, Jonathan Ho, Mohammad Norouzi
We propose Generalized Gaussian Diffusion Processes (GGDP), a family of non-Markovian samplers for diffusion models, and we show how to improve the generated samples of pre-trained DDPMs by optimizing the degrees of freedom of the GGDP sampler family with respect to a perceptual loss.
no code implementations • 27 Sep 2021 • Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang, Yonghui Wu
We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio.
Ranked #1 on Speech Recognition on Common Voice
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
3 code implementations • 17 Jun 2021 • Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, Najim Dehak, William Chan
The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform.
no code implementations • 7 Jun 2021 • Daniel Watson, Jonathan Ho, Mohammad Norouzi, William Chan
Key advantages of DDPMs include ease of training, in contrast to generative adversarial networks, and speed of generation, in contrast to autoregressive models.
no code implementations • 30 May 2021 • Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, Tim Salimans
We show that cascaded diffusion models are capable of generating high fidelity images on the class-conditional ImageNet generation benchmark, without any assistance from auxiliary image classifiers to boost sample quality.
Ranked #9 on Image Generation on ImageNet 64x64
4 code implementations • 15 Apr 2021 • Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, Mohammad Norouzi
We present SR3, an approach to image Super-Resolution via Repeated Refinement.
no code implementations • 7 Apr 2021 • Edwin G. Ng, Chung-Cheng Chiu, Yu Zhang, William Chan
We combine recent advancements in end-to-end speech recognition to non-autoregressive automatic speech recognition.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 5 Apr 2021 • William Chan, Daniel Park, Chris Lee, Yu Zhang, Quoc Le, Mohammad Norouzi
We present SpeechStew, a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal.
Ranked #1 on Speech Recognition on Switchboard CallHome
no code implementations • Findings of the Association for Computational Linguistics 2020 • Harris Chan, Jamie Kiros, William Chan
MGLM is a generative joint distribution model over channels.
7 code implementations • ICLR 2021 • Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, William Chan
This paper introduces WaveGrad, a conditional model for waveform generation which estimates gradients of the data density.
2 code implementations • EMNLP 2020 • Chitwan Saharia, William Chan, Saurabh Saxena, Mohammad Norouzi
In addition, we adapt the Imputer model for non-autoregressive machine translation and demonstrate that Imputer with just 4 generation steps can match the performance of an autoregressive Transformer baseline.
1 code implementation • ICML 2020 • William Chan, Chitwan Saharia, Geoffrey Hinton, Mohammad Norouzi, Navdeep Jaitly
This paper presents the Imputer, a neural sequence model that generates output sequences iteratively via imputations.
no code implementations • 15 Jan 2020 • Laura Ruis, Mitchell Stern, Julia Proskurnia, William Chan
We propose the Insertion-Deletion Transformer, a novel transformer-based neural architecture and training method for sequence generation.
no code implementations • 11 Dec 2019 • Daniel S. Park, Yu Zhang, Chung-Cheng Chiu, Youzheng Chen, Bo Li, William Chan, Quoc V. Le, Yonghui Wu
Recently, SpecAugment, an augmentation scheme for automatic speech recognition that acts directly on the spectrogram of input utterances, has shown to be highly effective in enhancing the performance of end-to-end networks on public datasets.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • EMNLP 2020 • William Chan, Mitchell Stern, Jamie Kiros, Jakob Uszkoreit
In this work, we present an empirical study of generation order for machine translation.
no code implementations • WS 2019 • Lala Li, William Chan
The Insertion Transformer is well suited for long form text generation due to its parallel generation capabilities, requiring $O(\log_2 n)$ generation steps to generate $n$ tokens.
no code implementations • 25 Sep 2019 • Harris Chan, Jamie Kiros, William Chan
For conditional generation, the model is given a fully observed channel, and generates the k-1 channels in parallel.
no code implementations • 13 Jun 2019 • Guan-Lin Chao, William Chan, Ian Lane
Speech recognition in cocktail-party environments remains a significant challenge for state-of-the-art speech recognition systems, as it is extremely difficult to extract an acoustic signal of an individual speaker from a background of overlapping speech with similar frequency and temporal characteristics.
no code implementations • 4 Jun 2019 • William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, Jakob Uszkoreit
During training, one can feed KERMIT paired data $(x, y)$ to learn the joint distribution $p(x, y)$, and optionally mix in unpaired data $x$ or $y$ to refine the marginals $p(x)$ or $p(y)$.
Ranked #38 on Machine Translation on WMT2014 English-German
30 code implementations • 18 Apr 2019 • Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le
On LibriSpeech, we achieve 6. 8% WER on test-other without the use of a language model, and 5. 8% WER with shallow fusion with a language model.
Ranked #1 on Speech Recognition on Hub5'00 SwitchBoard
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
2 code implementations • 21 Feb 2019 • Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, Yanzhang He, Jan Chorowski, Smit Hinsu, Stella Laurenzo, James Qin, Orhan Firat, Wolfgang Macherey, Suyog Gupta, Ankur Bapna, Shuyuan Zhang, Ruoming Pang, Ron J. Weiss, Rohit Prabhavalkar, Qiao Liang, Benoit Jacob, Bowen Liang, HyoukJoong Lee, Ciprian Chelba, Sébastien Jean, Bo Li, Melvin Johnson, Rohan Anil, Rajat Tibrewal, Xiaobing Liu, Akiko Eriguchi, Navdeep Jaitly, Naveen Ari, Colin Cherry, Parisa Haghani, Otavio Good, Youlong Cheng, Raziel Alvarez, Isaac Caswell, Wei-Ning Hsu, Zongheng Yang, Kuan-Chieh Wang, Ekaterina Gonina, Katrin Tomanek, Ben Vanik, Zelin Wu, Llion Jones, Mike Schuster, Yanping Huang, Dehao Chen, Kazuki Irie, George Foster, John Richardson, Klaus Macherey, Antoine Bruguier, Heiga Zen, Colin Raffel, Shankar Kumar, Kanishka Rao, David Rybach, Matthew Murray, Vijayaditya Peddinti, Maxim Krikun, Michiel A. U. Bacchiani, Thomas B. Jablin, Rob Suderman, Ian Williams, Benjamin Lee, Deepti Bhatia, Justin Carlson, Semih Yavuz, Yu Zhang, Ian McGraw, Max Galkin, Qi Ge, Golan Pundak, Chad Whipkey, Todd Wang, Uri Alon, Dmitry Lepikhin, Ye Tian, Sara Sabour, William Chan, Shubham Toshniwal, Baohua Liao, Michael Nirschl, Pat Rondon
Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models.
no code implementations • 8 Feb 2019 • Mitchell Stern, William Chan, Jamie Kiros, Jakob Uszkoreit
We present the Insertion Transformer, an iterative, partially autoregressive model for sequence generation based on insertion operations.
no code implementations • 7 Dec 2018 • Jianfeng Chi, Emmanuel Owusu, Xuwang Yin, Tong Yu, William Chan, Patrick Tague, Yuan Tian
We present a practical method for protecting data during the inference phase of deep learning based on bipartite topology threat modeling and an interactive adversarial deep network construction.
no code implementations • 22 Nov 2018 • Bo Li, Yu Zhang, Tara Sainath, Yonghui Wu, William Chan
We present two end-to-end models: Audio-to-Byte (A2B) and Byte-to-Audio (B2A), for multilingual speech recognition and synthesis.
2 code implementations • ICLR 2019 • Sara Sabour, William Chan, Mohammad Norouzi
We present Optimal Completion Distillation (OCD), a training procedure for optimizing sequence to sequence models based on edit distance.
no code implementations • EMNLP 2018 • Jamie Kiros, William Chan
Natural language inference has been shown to be an effective supervised task for learning generic sentence embeddings.
no code implementations • ACL 2018 • Jamie Kiros, William Chan, Geoffrey Hinton
We introduce Picturebook, a large-scale lookup operation to ground language via {`}snapshots{'} of our physical world accessed through image search.
6 code implementations • 8 Jun 2017 • Takaaki Hori, Shinji Watanabe, Yu Zhang, William Chan
The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
2 code implementations • 10 Oct 2016 • Yu Zhang, William Chan, Navdeep Jaitly
Sequence-to-sequence models have shown success in end-to-end speech recognition.
no code implementations • 10 Oct 2016 • William Chan, Yu Zhang, Quoc Le, Navdeep Jaitly
We present the Latent Sequence Decompositions (LSD) framework.
40 code implementations • 5 Aug 2015 • William Chan, Navdeep Jaitly, Quoc V. Le, Oriol Vinyals
Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly.
no code implementations • 7 Apr 2015 • William Chan, Ian Lane
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR).
Ranked #11 on Speech Recognition on WSJ eval92
no code implementations • 7 Apr 2015 • William Chan, Nan Rosemary Ke, Ian Lane
The small DNN trained on the soft RNN alignments achieved a 3. 93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4. 54 WER or more than 13% relative improvement.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1