Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module.
We justify that the refined 3D geometric priors aid in the 3D-aware capability of 2D diffusion priors, which in turn provides superior guidance for the refinement of 3D geometric priors.
Denoising diffusion models (DDMs) have attracted attention for their exceptional generation quality and diversity.
Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios.
We also introduce latent DiffiT which consists of transformer model with the proposed self-attention layers, for high-resolution image generation.
Ranked #2 on Image Generation on ImageNet 256x256
Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion.