We also introduce latent DiffiT which consists of transformer model with the proposed self-attention layers, for high-resolution image generation.
Ranked #2 on
Image Generation
on ImageNet 256x256
In detail, we first train an image projection module to connect a vision encoder with LLM.
To address this issue, we propose Gaussian Grouping, which extends Gaussian Splatting to jointly reconstruct and segment anything in open-world 3D scenes.
What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages?
Automatic Speech Recognition
Speech-to-Speech Translation
+3
The RETVec embedding model is pre-trained using pair-wise metric learning to be robust against typos and character-level adversarial attacks.
We present, GauHuman, a 3D human model with Gaussian Splatting for both fast training (1 ~ 2 minutes) and real-time rendering (up to 189 FPS), compared with existing NeRF-based implicit representation modelling frameworks demanding hours of training and seconds of rendering per frame.
Denoising diffusion models (DDMs) have attracted attention for their exceptional generation quality and diversity.
Single image depth estimation is a foundational task in computer vision and generative modeling.
Additionally, experiments on 18 datasets further demonstrate that Monkey surpasses existing LMMs in many tasks like Image Captioning and various Visual Question Answering formats.
Large language models (LLMs) can potentially democratize access to medical knowledge.
Ranked #1 on
Multiple Choice Question Answering (MCQA)
on MedMCQA
(Dev Set (Acc-%) metric)
Conditional Text Generation
Multiple Choice Question Answering (MCQA)