Embedding Atlas: Low-Friction, Interactive Embedding Visualization

apple/embedding-atlas 9 May 2025

Embedding projections are popular for visualizing large datasets and models.

Friction

286
0.55 stars / hour

Practical Efficiency of Muon for Pretraining

KellerJordan/Muon 4 May 2025

We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff.

1,176
0.51 stars / hour

XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

bytedance/xverse 26 Jun 2025

Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs).

Attribute Scene Generation +2

528
0.46 stars / hour

R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration

zefan-cai/r-kv 30 May 2025

To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models.

Mathematical Reasoning

1,082
0.43 stars / hour

SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models

xid32/soundmind 15 Jun 2025

While large language models have shown reasoning capabilities, their application to the audio modality, particularly in large audio-language models (ALMs), remains significantly underdeveloped.

Logical Reasoning Reinforcement Learning (RL)

758
0.42 stars / hour

MinerU: An Open-Source Solution for Precise Document Content Extraction

opendatalab/mineru 27 Sep 2024

Document content analysis has been a crucial research area in computer vision.

Diversity Optical Character Recognition (OCR)

39,992
0.42 stars / hour

TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis

aaronz345/tcsinger2 20 May 2025

To overcome these challenges, we introduce TCSinger 2, a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts.

Contrastive Learning Singing Voice Synthesis +1

95
0.41 stars / hour

Energy-Based Transformers are Scalable Learners and Thinkers

alexiglad/EBT 2 Jul 2025

Further, we find that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, suggesting that EBTs generalize better than existing approaches.

Image Denoising Math

294
0.39 stars / hour

Cautious Optimizers: Improving Training with One Line of Code

kyleliang919/c-optim 25 Nov 2024

AdamW has been the default optimizer for transformer pretraining.

321
0.37 stars / hour

$τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

sierra-research/tau2-bench 9 Jun 2025

Existing benchmarks for conversational AI agents simulate single-control environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider.

AI Agent

100
0.37 stars / hour