DeMo: Decoupled Momentum Optimization

bloc97/demo 29 Nov 2024

Training large neural networks typically requires sharing gradients between accelerators through specialized high-speed interconnects.

10-shot image generation 1 Image, 2*2 Stitchi

133
1.53 stars / hour

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

Stability-AI/stable-codec 29 Nov 2024

The tokenization of speech with neural audio codec models is a vital part of modern AI pipelines for the generation or understanding of speech, alone or in a multimodal context.

Quantization

134
1.37 stars / hour

MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions

modelscope/ClearerVoice-Studio 23 Feb 2023

To effectively solve the indirect elemental interactions across chunks in the dual-path architecture, MossFormer employs a joint local and global self-attention architecture that simultaneously performs a full-computation self-attention on local chunks and a linearised low-cost self-attention over the full sequence.

 Ranked #1 on Speech Separation on WSJ0-2mix-16k (using extra training data)

Speech Separation

399
1.32 stars / hour

Open-Sora Plan: Open-Source Large Video Generation Model

PKU-YuanGroup/ConsisID 28 Nov 2024

We introduce Open-Sora Plan, an open-source project that aims to contribute a large generation model for generating desired high-resolution videos with long durations based on various user inputs.

Video Generation

429
1.31 stars / hour

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

FoundationVision/VAR 3 Apr 2024

We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction".

Image Generation Language Modelling +2

5,256
1.28 stars / hour

Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models

ictnlp/auto-rag 29 Nov 2024

Iterative retrieval refers to the process in which the model continuously queries the retriever during generation to enhance the relevance of the retrieved knowledge, thereby improving the performance of Retrieval-Augmented Generation (RAG).

Decision Making RAG +1

112
1.09 stars / hour

Multimodal Whole Slide Foundation Model for Pathology

mahmoodlab/titan 29 Nov 2024

The field of computational pathology has been transformed with recent advances in foundation models that encode histopathology region-of-interests (ROIs) into versatile and transferable feature representations via self-supervised learning (SSL).

Cross-Modal Retrieval Retrieval +2

79
0.95 stars / hour

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

showlab/showui 26 Nov 2024

In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances.

Instruction Following

459
0.90 stars / hour

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

kvcache-ai/Mooncake 24 Jun 2024

Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs.

2,079
0.89 stars / hour

TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

YesianRohn/TextSSR 2 Dec 2024

Experiments show that models trained on added TextSSR-F data exhibit better accuracy compared to models trained on 4 million existing synthetic data.

Image Generation Scene Text Recognition

51
0.84 stars / hour