FastVLM: Efficient Vision Encoding for Vision Language Models

apple/ml-fastvlm 17 Dec 2024

At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency.

3,719
1.00 stars / hour

Group-in-Group Policy Optimization for LLM Agent Training

langfengq/verl-agent 16 May 2025

In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence.

Mathematical Reasoning Reinforcement Learning (RL)

112
0.86 stars / hour

VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning

dvlab-research/VisionReasoner 17 May 2025

Large vision-language models exhibit inherent capabilities to handle diverse visual perception tasks.

2D Object Detection Object Counting +6

53
0.82 stars / hour

SOAP: Style-Omniscient Animatable Portraits

tingtingliao/soap 8 May 2025

Creating animatable 3D avatars from a single image remains challenging due to style limitations (realistic, cartoon, anime) and difficulties in handling accessories or hairstyles.

Image to 3D

171
0.81 stars / hour

MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation

dingyanb/mtvcrafter 15 May 2025

To tackle this problem, we propose MTVCrafter (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences (i. e., 4D motion) for human image animation.

Image Animation Video Generation

52
0.77 stars / hour

AdaptThink: Reasoning Models Can Learn When to Think

thu-keg/adaptthink 19 May 2025

Recently, large reasoning models have achieved impressive performance on various tasks by employing human-like deep thinking.

Math

33
0.74 stars / hour

Spherical Channels for Modeling Atomic Interactions

Open-Catalyst-Project/ocp 29 Jun 2022

We propose the Spherical Channel Network (SCN) to model atomic energies and forces.

10-shot image generation Computational chemistry +1

1,398
0.69 stars / hour

Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers

oliverrensu/grat 20 May 2025

We validate GRAT on pretrained Flux and HunyuanVideo for image and video generation, respectively.

Video Generation

23
0.67 stars / hour

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

multimodal-art-projection/korgym 20 May 2025

Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities.

reinforcement-learning Reinforcement Learning

18
0.67 stars / hour

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

LeapLabTHU/Absolute-Zero-Reasoner 6 May 2025

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards.

Mathematical Reasoning

1,272
0.62 stars / hour