DeMo: Decoupled Momentum Optimization

bloc97/demo 29 Nov 2024

Training large neural networks typically requires sharing gradients between accelerators through specialized high-speed interconnects.

107
2.63 stars / hour

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

kvcache-ai/Mooncake 24 Jun 2024

Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs.

1,923
2.56 stars / hour

MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions

modelscope/ClearerVoice-Studio 23 Feb 2023

To effectively solve the indirect elemental interactions across chunks in the dual-path architecture, MossFormer employs a joint local and global self-attention architecture that simultaneously performs a full-computation self-attention on local chunks and a linearised low-cost self-attention over the full sequence.

 Ranked #1 on Speech Separation on WSJ0-2mix-16k (using extra training data)

Speech Separation

230
1.41 stars / hour

Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models

ictnlp/auto-rag 29 Nov 2024

Iterative retrieval refers to the process in which the model continuously queries the retriever during generation to enhance the relevance of the retrieved knowledge, thereby improving the performance of Retrieval-Augmented Generation (RAG).

Decision Making RAG +1

68
1.25 stars / hour

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

PKU-YuanGroup/ConsisID 26 Nov 2024

We propose a hierarchical training strategy to leverage frequency information for identity preservation, transforming a vanilla pre-trained video generation model into an IPT2V model.

Text-to-Video Generation Video Generation

350
1.20 stars / hour

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

showlab/showui 26 Nov 2024

In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances.

Instruction Following

365
1.17 stars / hour

EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation

antgroup/echomimic_v2 15 Nov 2024

Recent work on human animation usually involves audio, pose, or movement maps conditions, thereby achieves vivid animation quality.

Audio-Driven Body Animation Human Animation +1

1,493
1.15 stars / hour

Star Attention: Efficient LLM Inference over Long Sequences

NVIDIA/Star-Attention 26 Nov 2024

Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism.

Computational Efficiency

240
1.12 stars / hour

OminiControl: Minimal and Universal Control for Diffusion Transformer

Yuanshi9815/OminiControl 22 Nov 2024

In this paper, we introduce OminiControl, a highly versatile and parameter-efficient framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models.

721
1.10 stars / hour

SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting

lllyasviel/ic-light CVPR 2024

We introduce a co-designed approach for human portrait relighting that combines a physics-guided architecture with a pre-training framework.

6,547
1.08 stars / hour