YOLOE: Real-Time Seeing Anything

THU-MIG/yoloe 10 Mar 2025

Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios.

10-shot image generation

542
3.69 stars / hour

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

sparkaudio/spark-tts 3 Mar 2025

Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis.

Attribute Text to Speech +1

4,269
3.07 stars / hour

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

bytedance/flux 27 Feb 2025

The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks.

Computational Efficiency

743
2.37 stars / hour

LBM: Latent Bridge Matching for Fast Image-to-Image Translation

gojasper/lbm 10 Mar 2025

In this paper, we introduce Latent Bridge Matching (LBM), a new, versatile and scalable method that relies on Bridge Matching in a latent space to achieve fast image-to-image translation.

Depth Estimation Image Relighting +2

103
1.91 stars / hour

FoundationStereo: Zero-Shot Stereo Matching

NVlabs/FoundationStereo 17 Jan 2025

However, achieving strong zero-shot generalization - a hallmark of foundation models in other computer vision tasks - remains challenging for stereo matching.

Diversity Stereo Depth Estimation +2

859
1.78 stars / hour

VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control

TencentARC/VideoPainter 7 Mar 2025

Video inpainting, which aims to restore corrupted video content, has experienced substantial progress.

Image Inpainting Optical Flow Estimation +3

202
1.65 stars / hour

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

osilly/vision-r1 9 Mar 2025

However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data.

Math Multimodal Reasoning +1

256
1.54 stars / hour

Scaling Synthetic Data Creation with 1,000,000,000 Personas

camel-ai/camel 28 Jun 2024

We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data.

Language Modeling Language Modelling +3

10,242
1.40 stars / hour

Learning Efficient Online 3D Bin Packing on Packing Configuration Trees

alexfrom0815/Online-3D-BPP-PCT ICLR 2022

PCT is a full-fledged description of the state and action space of bin packing which can support packing policy learning based on deep reinforcement learning (DRL).

3D Bin Packing Deep Reinforcement Learning

455
1.13 stars / hour

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

tidedra/lmm-r1 10 Mar 2025

Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment.

Logical Reasoning Multimodal Reasoning +1

533
1.07 stars / hour