Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

kvcache-ai/Mooncake 24 Jun 2024

Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs.

1,923
2.58 stars / hour

EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation

antgroup/echomimic_v2 15 Nov 2024

Recent work on human animation usually involves audio, pose, or movement maps conditions, thereby achieves vivid animation quality.

Audio-Driven Body Animation Human Animation +1

1,446
1.61 stars / hour

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

showlab/showui 26 Nov 2024

In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances.

Instruction Following

328
1.60 stars / hour

Star Attention: Efficient LLM Inference over Long Sequences

NVIDIA/Star-Attention 26 Nov 2024

Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism.

Computational Efficiency

231
1.46 stars / hour

OminiControl: Minimal and Universal Control for Diffusion Transformer

Yuanshi9815/OminiControl 22 Nov 2024

In this paper, we introduce OminiControl, a highly versatile and parameter-efficient framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models.

697
1.32 stars / hour

One Diffusion to Generate Them All

lehduong/onediffusion 25 Nov 2024

Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset.

Camera Pose Estimation Deblurring +4

248
1.24 stars / hour

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

yangchris11/samurai 18 Nov 2024

The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast-moving or self-occluding objects.

Visual Object Tracking Visual Tracking

5,421
1.17 stars / hour

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

PKU-YuanGroup/ConsisID 26 Nov 2024

We propose a hierarchical training strategy to leverage frequency information for identity preservation, transforming a vanilla pre-trained video generation model into an IPT2V model.

Text-to-Video Generation Video Generation

282
1.13 stars / hour

StableAnimator: High-Quality Identity-Preserving Human Image Animation

Francis-Rings/StableAnimator 26 Nov 2024

During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality.

Denoising Face Reenactment +3

144
1.00 stars / hour

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions

aidc-ai/marco-o1 21 Nov 2024

Currently OpenAI o1 sparks a surge of interest in the study of large reasoning models (LRM).

Reinforcement Learning (RL)

1,033
0.99 stars / hour