The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive and non-autoregressive systems.
To facilitate the scale-up of Emilia, we also present Emilia-Pipe, the first open-source preprocessing pipeline designed to efficiently transform raw, in-the-wild speech data into high-quality training data with speech annotations.
It can understand visual, auditory, and textual modalities, directly output audio, and support flexible duplex interaction.
The recently developed retrieval-augmented generation (RAG) technology has enabled the efficient construction of domain-specific applications.
Our second contribution, DreamClear, is a DiT-based image restoration model.
When pretrained on Objects365, D-FINE-L / X attains 57. 1% / 59. 3% AP, surpassing all existing real-time detectors.
Ranked #1 on Real-Time Object Detection on MS COCO (using extra training data)
The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network.
In this work, we introduce OmniGen, a new diffusion model for unified image generation.
Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.