Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

bytedance/dolphin 20 May 2025

Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables.

2,009
1.72 stars / hour

TradingAgents: Multi-Agents LLM Financial Trading Framework

tauricresearch/tradingagents 28 Dec 2024

Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs).

Management

4,620
1.70 stars / hour

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

facebookresearch/vjepa2 11 Jun 2025

Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset.

Action Anticipation Large Language Model +3

1,435
1.53 stars / hour

Efficient Part-level 3D Object Generation via Dual Volume Packing

nvlabs/partpacker 11 Jun 2025

Recent progress in 3D object generation has greatly improved both the quality and efficiency.

Diversity Object

308
1.13 stars / hour

Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs

hcplab-sysu/causal-vlreasoning 23 Aug 2023

Drawing inspiration from the orchestration of diverse specialized agents collaborating to tackle intricate tasks, we propose a framework named Causal-Consistency Chain-of-Thought (CaCo-CoT) that harnesses multi-agent collaboration to bolster the faithfulness and causality of foundation models, involving a set of reasoners and evaluators.

counterfactual Science Question Answering

500
1.08 stars / hour

MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments

hcplab-sysu/causalvlr 1 Feb 2024

To overcome this limitation, we introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions.

Embodied Question Answering Language Modeling +4

499
1.07 stars / hour

VGGT: Visual Geometry Grounded Transformer

facebookresearch/vggt CVPR 2025

We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views.

Depth Estimation Novel View Synthesis +3

8,413
0.89 stars / hour

Ming-Omni: A Unified Multimodal Model for Perception and Generation

inclusionai/ming 11 Jun 2025

We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation.

Image Generation text-to-speech +1

324
0.82 stars / hour

R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration

zefan-cai/r-kv 30 May 2025

To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models.

Mathematical Reasoning

495
0.75 stars / hour

OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems

oliverleexz/opt-bench 12 Jun 2025

Large Language Models (LLMs) have shown remarkable capabilities in solving diverse tasks.

110
0.72 stars / hour