We also introduce latent DiffiT which consists of transformer model with the proposed self-attention layers, for high-resolution image generation.
Ranked #2 on Image Generation on ImageNet 256x256
In detail, we first train an image projection module to connect a vision encoder with LLM.
We present, GauHuman, a 3D human model with Gaussian Splatting for both fast training (1 ~ 2 minutes) and real-time rendering (up to 189 FPS), compared with existing NeRF-based implicit representation modelling frameworks demanding hours of training and seconds of rendering per frame.
Denoising diffusion models (DDMs) have attracted attention for their exceptional generation quality and diversity.
Single image depth estimation is a foundational task in computer vision and generative modeling.
Additionally, experiments on 18 datasets further demonstrate that Monkey surpasses existing LMMs in many tasks like Image Captioning and various Visual Question Answering formats.
Large language models (LLMs) can potentially democratize access to medical knowledge.
Ranked #1 on Multiple Choice Question Answering (MCQA) on MedMCQA (Dev Set (Acc-%) metric)