Further pursuing the unification of generation and editing tasks has yielded significant progress in the domain of image content creation.
Recently, large language model (LLM) based text-to-speech (TTS) systems have gradually become the mainstream in the industry due to their high naturalness and powerful zero-shot voice cloning capabilities. Here, we introduce the IndexTTS system, which is mainly based on the XTTS and Tortoise model.
To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS.
Large Language Models (LLMs) have demonstrated effectiveness in code generation tasks.
Urban roads and infrastructure, vital to city operations, face growing threats from subsurface anomalies like cracks and cavities.
Deep learning architectures for supervised learning on tabular data range from simple multilayer perceptrons (MLP) to sophisticated Transformers and retrieval-augmented methods.
Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions.
Reliable long-term forecast of Earth system dynamics is heavily hampered by instabilities in current AI models during extended autoregressive simulations.
We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas.
First, $\textbf{Urban1960SatBench}$ serves as a novel, expertly annotated semantic segmentation dataset built on mid-20$^{th}$ century Keyhole imagery, covering 1, 240 km$^2$ and key urban classes (buildings, roads, farmland, water).