Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables.
We introduce Mirage, the first multi-level superoptimizer for tensor programs.
To further enhance musicality and instruction following, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO).
As the use of large language models (LLMs) expands rapidly, so does the range of knowledge needed to supplement various LLM queries.
The queries are then executed by a serverless query engine that offers varying prices for different performance service levels (SLAs).
Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs).
To overcome this limitation, we introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions.
Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner.
Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset.
Ranked #4 on
Video Question Answering
on TVBench
The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction.