mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

x-plug/mplug-docowl 5 Sep 2024

Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images.

document understanding Optical Character Recognition (OCR) +1

1,866
0.46 stars / hour

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

mark12ding/sam2long 21 Oct 2024

Benefiting from its heuristic search design, SAM2Long is robust toward occlusions and object reappearances, and can effectively segment and track objects for complex long-term videos.

Object Segmentation +4

493
0.44 stars / hour

Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models

ictnlp/auto-rag 29 Nov 2024

Iterative retrieval refers to the process in which the model continuously queries the retriever during generation to enhance the relevance of the retrieved knowledge, thereby improving the performance of Retrieval-Augmented Generation (RAG).

Decision Making RAG +1

136
0.44 stars / hour

XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

lxa9867/imagefolder 2 Dec 2024

Improvements in architecture, quantization techniques, and training recipes have significantly enhanced both image reconstruction and the downstream generation quality.

Image Reconstruction Quantization

140
0.43 stars / hour

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

kvcache-ai/Mooncake 24 Jun 2024

Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs.

2,143
0.42 stars / hour

Occupancy-Based Dual Contouring

kaist-visual-ai-group/odc 20 Sep 2024

Based on Manifold Dual Contouring (MDC), we propose Occupancy-Based Dual Contouring (ODC), which mainly modifies the computation of grid edge points (1D points) and grid cell points (3D points) to not use any distance information.

3D Reconstruction

71
0.42 stars / hour

DF40: Toward Next-Generation Deepfake Detection

YZY-stack/DF40 19 Jun 2024

In this work, we found the dataset (both train and test) can be the "primary culprit" due to: (1) forgery diversity: Deepfake techniques are commonly referred to as both face forgery and entire image synthesis.

DeepFake Detection Face Reenactment +2

214
0.40 stars / hour

DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting

mycfhs/dreammix 26 Nov 2024

Extensive experiments demonstrate that DreamMix effectively balances identity preservation and attribute editability across various application scenarios, including object insertion, attribute editing, and small object inpainting.

Attribute Diversity +2

93
0.40 stars / hour

Multimodal Whole Slide Foundation Model for Pathology

mahmoodlab/titan 29 Nov 2024

The field of computational pathology has been transformed with recent advances in foundation models that encode histopathology region-of-interests (ROIs) into versatile and transferable feature representations via self-supervised learning (SSL).

Cross-Modal Retrieval Retrieval +2

98
0.40 stars / hour

TEXGen: a Generative Diffusion Model for Mesh Textures

CVMI-Lab/TEXGen 22 Nov 2024

Instead, we focus on the fundamental problem of learning in the UV texture space itself.

Texture Synthesis

162
0.40 stars / hour