In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.
Ranked #1 on
Zero-Shot Video Question Answer
on MSVD-QA
We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation.
We present a method to create interpretable concept sliders that enable precise control over attributes in image generations from diffusion models.
Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks.
Language models only really need to use an exponential fraction of their neurons for individual inferences.
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
Latent Consistency Models (LCMs) have achieved impressive performance in accelerating text-to-image generative tasks, producing high-quality images with minimal inference steps.
Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios.
In this work, we propose MagicDance, a diffusion-based model for 2D human motion and facial expression transfer on challenging human dance videos.
We introduce CogVLM, a powerful open-source visual language foundation model.
Ranked #3 on
Visual Question Answering (VQA)
on CORE-MM