Despite much progress in training AI systems to imitate human language, building agents that use language to communicate intentionally with humans in interactive environments remains a major challenge.
Language-guided image editing has achieved great success recently.
In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model.
We present SinDiffusion, leveraging denoising diffusion models to capture internal distribution of patches from a single natural image.
Ranked #1 on Image Generation on Places50
In this paper, we present RenderDiffusion as the first diffusion model for 3D generation and inference that can be trained using only monocular 2D supervision.
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data.
Ranked #1 on Object Detection on LVIS v1.0 val (using extra training data)
Diffusion models (DMs) are another class of deep generative models and have recently achieved remarkable performance on various image synthesis tasks.
Ranked #1 on Video Generation on Taichi