Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

sparkaudio/spark-tts 3 Mar 2025

Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis.

Attribute Text to Speech +1

4,575
2.89 stars / hour

YOLOE: Real-Time Seeing Anything

THU-MIG/yoloe 10 Mar 2025

Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios.

10-shot image generation

691
2.33 stars / hour

Neural Fields with Thermal Activations for Arbitrary-Scale Super-Resolution

prs-eth/thera 29 Nov 2023

We present a novel way to design neural fields such that points can be queried with an adaptive Gaussian PSF, so as to guarantee correct anti-aliasing at any desired output resolution.

Image Super-Resolution

222
1.71 stars / hour

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

kuleshov-group/bd3lms 12 Mar 2025

Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation.

Denoising Language Modeling +1

311
1.70 stars / hour

Agent S: An Open Agentic Framework that Uses Computers Like a Human

simular-ai/agent-s 10 Oct 2024

We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automating complex, multi-step tasks.

AI Agent Task Planning

1,200
1.63 stars / hour

LBM: Latent Bridge Matching for Fast Image-to-Image Translation

gojasper/lbm 10 Mar 2025

In this paper, we introduce Latent Bridge Matching (LBM), a new, versatile and scalable method that relies on Bridge Matching in a latent space to achieve fast image-to-image translation.

Depth Estimation Image Relighting +2

166
1.25 stars / hour

FoundationStereo: Zero-Shot Stereo Matching

NVlabs/FoundationStereo 17 Jan 2025

However, achieving strong zero-shot generalization - a hallmark of foundation models in other computer vision tasks - remains challenging for stereo matching.

Diversity Stereo Depth Estimation +2

935
1.15 stars / hour

Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

hpcaitech/open-sora 12 Mar 2025

With this model, we demonstrate that the cost of training a top-performing video generation model is highly controllable.

Video Generation

25,137
1.00 stars / hour

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

osilly/vision-r1 9 Mar 2025

However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data.

Math Multimodal Reasoning +1

284
0.99 stars / hour

Slim attention: cut your context memory in half without loss of accuracy -- K-cache is all you need for MHA

openmachine-ai/transformer-tricks 7 Mar 2025

For encoder-decoder transformers, the context memory size can be reduced even further: For the Whisper models for example, slim attention reduces the context memory by 8x, which can speed up token generation by 5x for batch size 64 for example.

All Decoder

107
0.72 stars / hour