We introduce MonkeyOCR, a vision-language model for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm.
We instantiate this framework in a web agent based on the ReAct, WebDancer.
While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging.
As the use of large language models (LLMs) expands rapidly, so does the range of knowledge needed to supplement various LLM queries.
Recently, large language model (LLM) based text-to-speech (TTS) systems have gradually become the mainstream in the industry due to their high naturalness and powerful zero-shot voice cloning capabilities. Here, we introduce the IndexTTS system, which is mainly based on the XTTS and Tortoise model.
This paper aims to achieve universal segmentation of arbitrary semantic level.
Ranked #1 on
Referring Expression Segmentation
on RefCOCOg-test
(using extra training data)
Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs).
To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules.
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables.