In response, we introduce TorchFX: a GPU-accelerated Python library for DSP, specifically engineered to facilitate sophisticated audio signal processing.
Recently, large language model (LLM) based text-to-speech (TTS) systems have gradually become the mainstream in the industry due to their high naturalness and powerful zero-shot voice cloning capabilities. Here, we introduce the IndexTTS system, which is mainly based on the XTTS and Tortoise model.
We introduce Zep, a novel memory layer service for AI agents that outperforms the current state-of-the-art system, MemGPT, in the Deep Memory Retrieval (DMR) benchmark.
We present PixelFlow, a family of image generation models that operate directly in the raw pixel space, in contrast to the predominant latent-space models.
Ranked #36 on
Image Generation
on ImageNet 256x256
Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation.
The refined monodepth is in turn guides stereo effectively at ill-posed regions.
3D Gaussian Splatting (3DGS) enables efficient reconstruction and high-fidelity real-time rendering of complex scenes on consumer hardware.
Achieving flexible and high-fidelity identity-preserved image generation remains formidable, particularly with advanced Diffusion Transformers (DiTs) like FLUX.
Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively.
We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views.