Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

bytedance/dolphin 20 May 2025

Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables.

16k

2,940
2.23 stars / hour

Mirage: A Multi-Level Superoptimizer for Tensor Programs

mirage-project/mirage 9 May 2024

We introduce Mirage, the first multi-level superoptimizer for tensor programs.

Navigate

1,230
1.81 stars / hour

LeVo: High-Quality Song Generation with Multi-Preference Alignment

tencent-ailab/songgeneration 9 Jun 2025

To further enhance musicality and instruction following, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO).

Instruction Following Music Generation

288
1.34 stars / hour

Do Large Language Models Need a Content Delivery Network?

lmcache/lmcache 16 Sep 2024

As the use of large language models (LLMs) expands rapidly, so does the range of knowledge needed to supplement various LLM queries.

In-Context Learning

1,623
1.23 stars / hour

PixelsDB: Serverless and NL-Aided Data Analytics with Flexible Service Levels and Prices

pixelsdb/pixels 30 May 2024

The queries are then executed by a serverless query engine that offers varying prices for different performance service levels (SLAs).

Scheduling

409
1.23 stars / hour

TradingAgents: Multi-Agents LLM Financial Trading Framework

tauricresearch/tradingagents 28 Dec 2024

Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs).

Management

4,791
1.14 stars / hour

MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments

hcplab-sysu/causalvlr 1 Feb 2024

To overcome this limitation, we introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions.

Embodied Question Answering Language Modeling +4

608
1.03 stars / hour

Visual Causal Scene Refinement for Video Question Answering

hcplab-sysu/causal-vlreasoning 7 May 2023

Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner.

Contrastive Learning Question Answering +2

608
1.02 stars / hour

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

facebookresearch/vjepa2 11 Jun 2025

Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset.

Action Anticipation Large Language Model +3

1,476
0.94 stars / hour

Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

ictnlp/stream-omni 16 Jun 2025

The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction.

Large Language Model multimodal interaction

133
0.83 stars / hour