Despite much progress in training AI systems to imitate human language, building agents that use language to communicate intentionally with humans in interactive environments remains a major challenge.
Language-guided image editing has achieved great success recently.
In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model.
We present SinDiffusion, leveraging denoising diffusion models to capture internal distribution of patches from a single natural image.
Ranked #1 on Image Generation on Places50
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data.
Ranked #1 on Object Detection on LVIS v1.0 val (using extra training data)
In this paper, we demonstrate that a learned discrete codebook prior in a small proxy space largely reduces the uncertainty and ambiguity of restoration mapping by casting blind face restoration as a code prediction task, while providing rich visual atoms for generating high-quality faces.
Diffusion models (DMs) are another class of deep generative models and have recently achieved remarkable performance on various image synthesis tasks.
Ranked #1 on Video Generation on Taichi