In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.
Ranked #1 on Zero-Shot Video Question Answer on TGIF-QA
We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation.
We present a method to create interpretable concept sliders that enable precise control over attributes in image generations from diffusion models.
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
Latent Consistency Models (LCMs) have achieved impressive performance in accelerating text-to-image generative tasks, producing high-quality images with minimal inference steps.
Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios.
In this work, we propose MagicDance, a diffusion-based model for 2D human motion and facial expression transfer on challenging human dance videos.