MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm

yuliang-liu/monkeyocr 5 Jun 2025

We introduce MonkeyOCR, a vision-language model for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm.

GPU Relation +1

5,031
0.50 stars / hour

Practical Efficiency of Muon for Pretraining

KellerJordan/Muon 4 May 2025

We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff.

1,201
0.47 stars / hour

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

NVIDIA/audio-flamingo 2 Feb 2024

Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs.

Acoustic Scene Classification Few-Shot Learning +6

604
0.47 stars / hour

TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis

aaronz345/tcsinger2 20 May 2025

To overcome these challenges, we introduce TCSinger 2, a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts.

Contrastive Learning Singing Voice Synthesis +1

105
0.44 stars / hour

TradingAgents: Multi-Agents LLM Financial Trading Framework

tauricresearch/tradingagents 28 Dec 2024

Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs).

Management

15,771
0.43 stars / hour

ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

FunAudioLLM/ThinkSound 26 Jun 2025

While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging.

Audio Generation Large Language Model +1

813
0.42 stars / hour

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

HKUDS/SepLLM 16 Dec 2024

This observation suggests that information of the segments between these separator tokens can be effectively condensed into the separator tokens themselves without significant information loss.

GSM8K Language Modeling +1

319
0.40 stars / hour

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

yfzhang114/r1_reward 5 May 2025

Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks.

Reinforcement Learning (RL)

240
0.39 stars / hour

DiC: Rethinking Conv3x3 Designs in Diffusion Models

yuchuantian/dic CVPR 2025

Diffusion models have shown exceptional performance in visual generation tasks.

Decoder

81
0.36 stars / hour

Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts

lileyang1210/ant 17 Apr 2025

Anchor-free methods risk disrupting sampling trajectories, leading to visual artifacts, while anchor-based methods rely on the heuristic selection of anchor concepts.

Denoising

64
0.35 stars / hour