We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation.
Additionally, experiments on 18 datasets further demonstrate that Monkey surpasses existing LMMs in many tasks like Image Captioning and various Visual Question Answering formats.
We present, GauHuman, a 3D human model with Gaussian Splatting for both fast training (1 ~ 2 minutes) and real-time rendering (up to 189 FPS), compared with existing NeRF-based implicit representation modelling frameworks demanding hours of training and seconds of rendering per frame.
Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios.
This approach not only enables object pose generation based on arbitrary keypoint definitions but also significantly reduces the associated costs, paving the way for versatile and adaptable pose estimation applications.
Ranked #1 on 2D Pose Estimation on MP-100
To address these challenges, we introduce StyleCrafter, a generic method that enhances pre-trained T2V models with a style control adapter, enabling video generation in any style by providing a reference image.
Besides, although LLMs have shown their pure text-based reasoning ability, it is underexplored whether such ability can be generalized to graph scenarios (i. e., graph-based reasoning).
Large language models (LLMs) can potentially democratize access to medical knowledge.
Ranked #1 on Multiple Choice Question Answering (MCQA) on MedMCQA (Dev Set (Acc-%) metric)
Denoising diffusion models (DDMs) have attracted attention for their exceptional generation quality and diversity.