Moonshine: Speech Recognition for Live Transcription and Voice Commands

usefulsensors/moonshine 21 Oct 2024

This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing.

Decoder Position +2

1,792
3.49 stars / hour

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

open-mmlab/amphion 1 Sep 2024

The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive and non-autoregressive systems.

Self-Supervised Learning Text to Speech

6,758
2.60 stars / hour

Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation

open-mmlab/Amphion 7 Jul 2024

To facilitate the scale-up of Emilia, we also present Emilia-Pipe, the first open-source preprocessing pipeline designed to efficiently transform raw, in-the-wild speech data into high-quality training data with speech annotations.

6,720
2.42 stars / hour

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

gpt-omni/mini-omni2 15 Oct 2024

It can understand visual, auditory, and textual modalities, directly output audio, and support flexible duplex interaction.

Language Modelling

1,299
1.89 stars / hour

KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation

openspg/kag 10 Sep 2024

The recently developed retrieval-augmented generation (RAG) technology has enabled the efficient construction of domain-specific applications.

Knowledge Graphs Question Answering +2

390
1.80 stars / hour

DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation

shallowdream204/dreamclear 24 Oct 2024

Our second contribution, DreamClear, is a DiT-based image restoration model.

Image Restoration

345
1.28 stars / hour

D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement

Peterande/D-FINE 17 Oct 2024

When pretrained on Objects365, D-FINE-L / X attains 57. 1% / 59. 3% AP, surpassing all existing real-time detectors.

 Ranked #1 on Real-Time Object Detection on MS COCO (using extra training data)

Real-Time Object Detection regression

429
1.27 stars / hour

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

alibaba/Tora 31 Jul 2024

The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network.

Video Compression Video Generation

484
1.25 stars / hour

OmniGen: Unified Image Generation

vectorspacelab/omnigen 17 Sep 2024

In this work, we introduce OmniGen, a new diffusion model for unified image generation.

Edge Detection Pose Estimation +2

1,275
1.08 stars / hour

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Vision-CAIR/LongVU 22 Oct 2024

Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

Token Reduction Video Understanding +1

189
1.05 stars / hour