FastVLM: Efficient Vision Encoding for Vision Language Models

apple/ml-fastvlm 17 Dec 2024

At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency.

1,804
7.28 stars / hour

Continuous Thought Machines

SakanaAI/continuous-thought-machines 8 May 2025

The CTM has two core innovations: (1) neuron-level temporal processing, where each neuron uses unique weight parameters to process a history of incoming signals; and (2) neural synchronization employed as a latent representation.

Computational Efficiency Question Answering

498
6.29 stars / hour

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

LeapLabTHU/Absolute-Zero-Reasoner 6 May 2025

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards.

Mathematical Reasoning

807
3.07 stars / hour

Generating Physically Stable and Buildable LEGO Designs from Text

AvaLovelace1/LegoGPT 8 May 2025

Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing LEGO designs that align closely with the input text prompts.

3D Generation Large Language Model +1

887
2.71 stars / hour

Flow-GRPO: Training Flow Matching Models via Online RL

yifan123/flow_grpo 8 May 2025

We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models.

Denoising Diversity +3

405
2.07 stars / hour

LTX-Video: Realtime Video Latent Diffusion

Lightricks/LTX-Video 30 Dec 2024

To address this, our VAE decoder is tasked with both latent-to-pixel conversion and the final denoising step, producing the clean result directly in pixel space.

Denoising Image to Video Generation

5,324
1.61 stars / hour

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

hitsz-tmg/awesome-large-multimodal-reasoning-models 8 May 2025

Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning.

Multimodal Reasoning

238
1.42 stars / hour

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

opendrivelab/univla 9 May 2025

Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding.

Vision-Language-Action

143
1.29 stars / hour

OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation

OpenHelix-robot/OpenHelix 6 May 2025

Dual-system VLA (Vision-Language-Action) architectures have become a hot topic in embodied intelligence research, but there is a lack of sufficient open-source work for further performance analysis and optimization.

Vision-Language-Action

118
1.22 stars / hour

Unified Continuous Generative Models

LINs-Lab/UCGM 12 May 2025

We introduce a unified framework for training, sampling, and analyzing these models.

Image Generation

42
1.11 stars / hour