MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions

modelscope/ClearerVoice-Studio 23 Feb 2023

To effectively solve the indirect elemental interactions across chunks in the dual-path architecture, MossFormer employs a joint local and global self-attention architecture that simultaneously performs a full-computation self-attention on local chunks and a linearised low-cost self-attention over the full sequence.

 Ranked #1 on Speech Separation on WSJ0-2mix-16k (using extra training data)

Speech Separation

1,724
5.35 stars / hour

StableAnimator: High-Quality Identity-Preserving Human Image Animation

Francis-Rings/StableAnimator 26 Nov 2024

During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality.

Denoising Face Reenactment +3

546
1.97 stars / hour

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

kwaivgi/syncammaster 10 Dec 2024

Recent advancements in video diffusion models have shown exceptional abilities in simulating real-world dynamics and maintaining 3D consistency.

4D reconstruction Video Generation

197
1.93 stars / hour

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

cshaitao/awesome-llms-as-judges 7 Dec 2024

Finally, we provide a detailed analysis of the limitations of LLM judges and discuss potential future directions.

154
1.66 stars / hour

HunyuanVideo: A Systematic Framework For Large Video Generative Models

tencent/hunyuanvideo 3 Dec 2024

In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models.

Video Alignment Video Generation

5,787
1.55 stars / hour

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

opendatalab/OmniDocBench 10 Dec 2024

Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies.

Attribute Benchmarking +2

110
1.10 stars / hour

SSL4EO-L: Datasets and Foundation Models for Landsat Imagery

microsoft/torchgeo NeurIPS 2023

The Landsat program is the longest-running Earth observation program in history, with 50+ years of data acquisition by 8 satellites.

Cloud Detection Earth Observation +2

3,087
1.09 stars / hour

Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey

imxtx/awesome-controllabe-speech-synthesis 9 Dec 2024

In this paper, we conduct a comprehensive survey of controllable TTS, covering approaches ranging from basic control techniques to methods utilizing natural language prompts, aiming to provide a clear understanding of the current state of research.

Speech Synthesis Survey +1

74
1.07 stars / hour

Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

FoundationVision/Infinity 5 Dec 2024

We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of generating high-resolution, photorealistic images following language instruction.

Image Generation

326
1.07 stars / hour

o1-Coder: an o1 Replication for Coding

adam-bjtu/o1-coder 29 Nov 2024

The technical report introduces O1-CODER, an attempt to replicate OpenAI's o1 model with a focus on coding tasks.

Reinforcement Learning (RL)

219
1.07 stars / hour