Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets.
Ranked #1 on Image Generation on ImageNet 512x512
Text-to-image synthesis has recently seen significant progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models.
Ranked #19 on Text-to-Image Generation on MS COCO
2 code implementations • 2 Nov 2022 • Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, Ming-Yu Liu
Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages.
Ranked #14 on Text-to-Image Generation on MS COCO
We introduce the problem of disentangling time-lapse sequences in a way that allows separate, after-the-fact control of overall trends, cyclic effects, and random effects in the images, and describe a technique based on data-driven generative models that achieves this goal.
Existing video generation methods often fail to produce new content as a function of time while maintaining consistencies expected in real environments, such as plausible dynamics and object persistence.
We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices.
We observe that despite their hierarchical convolutional nature, the synthesis process of typical generative adversarial networks depends on absolute pixel coordinates in an unhealthy manner.
Ranked #1 on Image Generation on FFHQ-U
We present a modular differentiable renderer design that yields performance superior to previous methods by leveraging existing, highly optimized hardware graphics pipelines.
We also find that the widely used CIFAR-10 is, in fact, a limited data benchmark, and improve the record FID from 5. 59 to 2. 42.
Ranked #1 on Conditional Image Generation on ArtBench-10 (32x32)
Overall, our improved model redefines the state of the art in unconditional image modeling, both in terms of existing distribution quality metrics as well as perceived image quality.
Ranked #1 on Image Generation on LSUN Car 256 x 256
Consistency regularization describes a class of approaches that have yielded ground breaking results in semi-supervised classification problems.
We analyze the problem of semantic segmentation and find that its' distribution does not exhibit low density regions separating classes and offer this as an explanation for why semi-supervised segmentation is a challenging problem, with only a few reports of success.
Unsupervised image-to-image translation methods learn to map images in a given class to an analogous image in a different class, drawing on unstructured (non-registered) datasets of images.
The ability to automatically estimate the quality and coverage of the samples produced by a generative model is a vital requirement for driving algorithm research.
We describe a novel method for training high-quality image denoising models based on unorganized collections of corrupted images.
We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature.
Ranked #1 on Image Generation on LSUN Bedroom
We apply basic statistical reasoning to signal reconstruction by machine learning -- learning to map corrupted observations to clean signals -- with a simple and powerful conclusion: it is possible to learn to restore images by only looking at corrupted examples, at performance at and sometimes exceeding training using clean data, without explicit image priors or likelihood models of the corruption.
We describe a new training methodology for generative adversarial networks.
Ranked #4 on Image Generation on LSUN Horse 256 x 256 (Clean-FID (trainfull) metric)
Our deep neural network learns a mapping from input waveforms to the 3D vertex coordinates of a face model, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone.
We propose a new criterion based on Taylor expansion that approximates the change in the cost function induced by pruning network parameters.
In this paper, we present a simple and efficient method for training deep neural networks in a semi-supervised setting where only a small portion of training data is labeled.
We present a real-time deep learning framework for video-based facial performance capture -- the dense 3D tracking of an actor's face given a monocular video.