Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

sparkaudio/spark-tts 3 Mar 2025

Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis.

Attribute Text to Speech +1

4,085
7.64 stars / hour

YOLOE: Real-Time Seeing Anything

THU-MIG/yoloe 10 Mar 2025

Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios.

10-shot image generation

449
3.41 stars / hour

Scaling Synthetic Data Creation with 1,000,000,000 Personas

lightaime/camel 28 Jun 2024

We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data.

Language Modeling Language Modelling +3

10,135
3.11 stars / hour

VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control

TencentARC/VideoPainter 7 Mar 2025

Video inpainting, which aims to restore corrupted video content, has experienced substantial progress.

Image Inpainting Optical Flow Estimation +3

189
2.12 stars / hour

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

bytedance/flux 27 Feb 2025

The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks.

Computational Efficiency

725
1.48 stars / hour

GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control

nv-tlabs/GEN3C 5 Mar 2025

Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video.

Novel View Synthesis Video Generation

377
1.34 stars / hour

Visual-RFT: Visual Reinforcement Fine-Tuning

liuziyu77/visual-rft 3 Mar 2025

Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce.

Few-Shot Object Detection Fine-Grained Image Classification +4

1,216
1.26 stars / hour

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

opendrivelab/agibot-world 9 Mar 2025

Introducing AgiBot World, a large-scale platform comprising over 1 million trajectories across 217 tasks in five deployment scenarios, we achieve an order-of-magnitude increase in data scale compared to existing datasets.

1,738
1.14 stars / hour

FoundationStereo: Zero-Shot Stereo Matching

NVlabs/FoundationStereo 17 Jan 2025

However, achieving strong zero-shot generalization - a hallmark of foundation models in other computer vision tasks - remains challenging for stereo matching.

Diversity Stereo Depth Estimation +2

785
1.07 stars / hour

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

dvlab-research/Seg-Zero 9 Mar 2025

Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes.

Domain Generalization Open Vocabulary Object Detection +6

176
0.98 stars / hour