Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

osilly/vision-r1 9 Mar 2025

However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data.

Math Multimodal Reasoning +1

295
1.04 stars / hour

LBM: Latent Bridge Matching for Fast Image-to-Image Translation

gojasper/lbm 10 Mar 2025

In this paper, we introduce Latent Bridge Matching (LBM), a new, versatile and scalable method that relies on Bridge Matching in a latent space to achieve fast image-to-image translation.

Depth Estimation Image Relighting +2

197
1.01 stars / hour

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

qihoo360/light-r1 13 Mar 2025

The Light-R1 series of work validates training long-COT models from scratch, showcases the art in SFT data and releases SOTA models from RL.

Math

443
0.88 stars / hour

Zero-shot Voice Conversion with Diffusion Transformers

Plachtaa/seed-vc 15 Nov 2024

Zero-shot voice conversion aims to transform a source speech utterance to match the timbre of a reference speech from an unseen speaker.

In-Context Learning Voice Conversion

1,948
0.80 stars / hour

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

rongyaofang/got 13 Mar 2025

We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images.

Language Modeling Language Modelling +3

138
0.79 stars / hour

Slim attention: cut your context memory in half without loss of accuracy -- K-cache is all you need for MHA

openmachine-ai/transformer-tricks 7 Mar 2025

For encoder-decoder transformers, the context memory size can be reduced even further: For the Whisper models for example, slim attention reduces the context memory by 8x, which can speed up token generation by 5x for batch size 64 for example.

All Decoder

128
0.78 stars / hour

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

bytedance/flux 27 Feb 2025

The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks.

Computational Efficiency

775
0.78 stars / hour

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

tidedra/lmm-r1 10 Mar 2025

Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment.

Logical Reasoning Multimodal Reasoning +1

587
0.71 stars / hour

2 OLMo 2 Furious

allenai/OLMo-core 31 Dec 2024

Our modified model architecture and training recipe achieve both better training stability and improved per-token efficiency.

160
0.71 stars / hour

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

dcdmllm/healthgpt 14 Feb 2025

To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called VL-Health.

Language Modeling Language Modelling +1

569
0.67 stars / hour