Absolute Zero: Reinforced Self-play Reasoning with Zero Data

LeapLabTHU/Absolute-Zero-Reasoner 6 May 2025

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards.

Mathematical Reasoning

720
4.12 stars / hour

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

maitrix-org/voila 5 May 2025

A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner.

AI Agent Automatic Speech Recognition +4

287
1.82 stars / hour

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

aidc-ai/awesome-unified-multimodal-models 5 May 2025

Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation.

Survey Text-to-Image Generation

131
1.79 stars / hour

LTX-Video: Realtime Video Latent Diffusion

Lightricks/LTX-Video 30 Dec 2024

To address this, our VAE decoder is tasked with both latent-to-pixel conversion and the final denoising step, producing the clean result directly in pixel space.

Denoising Image to Video Generation

4,761
1.52 stars / hour

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

ruc-nlpir/webthinker 30 Apr 2025

Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities.

Navigate

678
1.33 stars / hour

FastVLM: Efficient Vision Encoding for Vision Language Models

apple/ml-fastvlm 17 Dec 2024

At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency.

542
1.28 stars / hour

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

simular-ai/agent-s 1 Apr 2025

Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries.

AI Agent Task Planning

4,569
1.24 stars / hour

LiftFeat: 3D Geometry-Aware Local Feature Matching

lyp-deeplearning/liftfeat 6 May 2025

We then design a 3D geometry-aware feature lifting module to fuse surface normal feature with raw 2D descriptor feature.

3D geometry Homography Estimation +3

79
1.23 stars / hour

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

facebookresearch/perception_models 17 Apr 2025

In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding.

Video Question Answering Video Understanding

1,020
0.98 stars / hour

LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

ictnlp/llama-omni2 5 May 2025

Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction.

Chatbot Decoder +3

144
0.88 stars / hour