MinerU: An Open-Source Solution for Precise Document Content Extraction

opendatalab/mineru 27 Sep 2024

Document content analysis has been a crucial research area in computer vision.

Diversity Optical Character Recognition (OCR)

19,583
0.57 stars / hour

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

idea-research/chatrex 27 Nov 2024

From the data perspective, we build a fully automated data engine and construct the Rexverse-2M dataset which possesses multiple granularities to support the joint training of perception and understanding.

63
0.56 stars / hour

WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

pku-yuangroup/wf-vae 26 Nov 2024

However, as the resolution and duration of generated videos increase, the encoding cost of Video VAEs becomes a limiting bottleneck in training LVDMs.

71
0.54 stars / hour

DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving

hustvl/diffusiondrive 22 Nov 2024

However, the numerous denoising steps in the robotic diffusion policy and the more dynamic, open-world nature of traffic scenes pose substantial challenges for generating diverse driving actions at a real-time speed.

Autonomous Driving Denoising

174
0.52 stars / hour

MagicPIG: LSH Sampling for Efficient LLM Generation

infini-ai-lab/magicpig 21 Oct 2024

MagicPIG stores the LSH hash tables and runs the attention computation on the CPU, which allows it to serve longer contexts and larger batch sizes with high approximation accuracy.

122
0.52 stars / hour

Multi-Programming Language Sandbox for LLMs

Ablustrund/MPLSandbox 30 Oct 2024

We introduce MPLSandbox, an out-of-the-box multi-programming language sandbox designed to provide unified and comprehensive feedback from compiler and analysis tools for Large Language Models (LLMs).

208
0.49 stars / hour

Docling Technical Report

DS4SD/docling 19 Aug 2024

This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion.

11,865
0.47 stars / hour

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

akariasai/openscholar 21 Nov 2024

Scientific progress depends on researchers' ability to synthesize the growing body of literature.

Retrieval

405
0.46 stars / hour

Visually Guided Generative Text-Layout Pre-training for Document Intelligence

veason-silverbullet/vitlp 25 Mar 2024

Prior study shows that pre-training techniques can boost the performance of visual document understanding (VDU), which typically requires models to gain abilities to perceive and reason both document texts and layouts (e. g., locations of texts and table-cells).

Document Classification document understanding +2

89
0.45 stars / hour

Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance

thunlp/proactiveagent 16 Oct 2024

The labeled data is used to train a reward model that simulates human judgment and serves as an automatic evaluator of the proactiveness of LLM agents.

66
0.44 stars / hour