Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

bytedance/dolphin 20 May 2025

Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables.

1,882
2.10 stars / hour

TradingAgents: Multi-Agents LLM Financial Trading Framework

tauricresearch/tradingagents 28 Dec 2024

Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs).

Management

4,353
1.67 stars / hour

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

facebookresearch/vjepa2 11 Jun 2025

Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset.

Action Anticipation Large Language Model +3

1,265
1.60 stars / hour

Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs

hcplab-sysu/causal-vlreasoning 23 Aug 2023

Drawing inspiration from the orchestration of diverse specialized agents collaborating to tackle intricate tasks, we propose a framework named Causal-Consistency Chain-of-Thought (CaCo-CoT) that harnesses multi-agent collaboration to bolster the faithfulness and causality of foundation models, involving a set of reasoners and evaluators.

counterfactual Science Question Answering

443
1.12 stars / hour

MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments

hcplab-sysu/causalvlr 1 Feb 2024

To overcome this limitation, we introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions.

Embodied Question Answering Language Modeling +4

442
1.11 stars / hour

MUSt3R: Multi-view Network for Stereo 3D Reconstruction

naver/must3r CVPR 2025

DUSt3R introduced a novel paradigm in geometric computer vision by proposing a model that can provide dense and unconstrained Stereo 3D Reconstruction of arbitrary image collections with no prior information about camera calibration nor viewpoint poses.

3D Reconstruction Articles +3

120
0.92 stars / hour

VGGT: Visual Geometry Grounded Transformer

facebookresearch/vggt CVPR 2025

We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views.

Depth Estimation Novel View Synthesis +3

8,246
0.89 stars / hour

Ming-Omni: A Unified Multimodal Model for Perception and Generation

inclusionai/ming 11 Jun 2025

We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation.

Image Generation text-to-speech +1

290
0.78 stars / hour

MagCache: Fast Video Generation with Magnitude-Aware Cache

Zehong-Ma/ComfyUI-MagCache 10 Jun 2025

Existing acceleration techniques for video diffusion models often rely on uniform heuristics or time-embedding variants to skip timesteps and reuse cached features.

SSIM Video Generation

110
0.75 stars / hour

R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration

zefan-cai/r-kv 30 May 2025

To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models.

Mathematical Reasoning

430
0.74 stars / hour