Given two images depicting a person and a garment worn by another person, our goal is to generate a visualization of how the garment might look on the input person.
Deep generative models are becoming increasingly powerful, now generating diverse high fidelity photo-realistic samples given text prompts.
Diffusion models have shown promising results on single-image super-resolution and other image- to-image translation tasks.
In the text-only domain, we find that character-aware models provide large gains on a novel spelling task (WikiSpell).
no code implementations • • Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont-Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah Laszlo, David J. Fleet, Radu Soricut, Jason Baldridge, Mohammad Norouzi, Peter Anderson, William Chan
Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.
no code implementations • 5 Oct 2022 • Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, Tim Salimans
We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models.
Ranked #1 on Video Generation on LAION-400M
To further evaluate the capabilities of the model, we introduce EntityDrawBench, a new benchmark that evaluates image generation for diverse entities, from frequent to rare, across multiple object categories including dogs, foods, landmarks, birds, and characters.
Ranked #3 on Text-to-Image Generation on MS COCO
4 code implementations • 23 May 2022 • Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, Mohammad Norouzi
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
Ranked #17 on Text-to-Image Generation on MS COCO (using extra training data)
Unlike existing techniques, we train a stochastic sampler that refines the output of a deterministic predictor and is capable of producing a diverse set of plausible reconstructions for a given input.
We expect this standardized evaluation protocol to play a role in advancing image-to-image translation research.
Ranked #1 on Colorization on ImageNet ctest10k
We show that cascaded diffusion models are capable of generating high fidelity images on the class-conditional ImageNet generation benchmark, without any assistance from auxiliary image classifiers to boost sample quality.
Ranked #3 on Image Generation on ImageNet 64x64
We present SR3, an approach to image Super-Resolution via Repeated Refinement.
In addition, we adapt the Imputer model for non-autoregressive machine translation and demonstrate that Imputer with just 4 generation steps can match the performance of an autoregressive Transformer baseline.
This paper presents the Imputer, a neural sequence model that generates output sequences iteratively via imputations.
In adversarial imitation learning, a discriminator is trained to differentiate agent episodes from expert demonstrations representing the desired behavior.
Allowing humans to interactively train artificial agents to understand language instructions is desirable for both practical and scientific reasons, but given the poor data efficiency of the current learning methods, this goal may require substantial research efforts.