Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

gpt-omni/mini-omni 29 Aug 2024

We also introduce the VoiceAssistant-400K dataset to fine-tune models optimized for speech output.

Speech Synthesis

1,422
5.35 stars / hour

FLUX that Plays Music

feizc/fluxmusic 1 Sep 2024

This paper explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation, termed as FluxMusic.

Music Generation Text-to-Music Generation

502
2.25 stars / hour

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

jishengpeng/wavtokenizer 29 Aug 2024

Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.

Language Modelling

506
1.71 stars / hour

rerankers: A Lightweight Python Library to Unify Ranking Methods

answerdotai/rerankers 30 Aug 2024

This paper presents rerankers, a Python library which provides an easy-to-use interface to the most commonly used re-ranking approaches.

Re-Ranking Retrieval

690
1.20 stars / hour

DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image Classification

chongqingnosubway/dgr-mil 4 Jul 2024

Second, we propose two mechanisms to enforce the diversity among the global vectors to be more descriptive of the entire bag: (i) positive instance alignment and (ii) a novel, efficient, and theoretically guaranteed diversification learning paradigm.

Descriptive Diversity +3

71
0.91 stars / hour

Sapiens: Foundation for Human Vision Models

facebookresearch/sapiens 22 Aug 2024

We present Sapiens, a family of models for four fundamental human-centric vision tasks -- 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction.

2D Pose Estimation Depth Estimation +3

3,661
0.80 stars / hour

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

freedomintelligence/longllava 4 Sep 2024

Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents.

Video Understanding

57
0.79 stars / hour

NTIRE 2024 Challenge on Low Light Image Enhancement: Methods and Results

caiyuanhao1998/retinexformer 22 Apr 2024

This paper reviews the NTIRE 2024 low light image enhancement challenge, highlighting the proposed solutions and results.

4k Low-Light Image Enhancement +1

795
0.72 stars / hour

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

nvlabs/eagle 28 Aug 2024

We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies.

Optical Character Recognition

308
0.68 stars / hour

VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

keytoyze/visionts 30 Aug 2024

Surprisingly, without further adaptation in the time-series domain, the proposed VisionTS could achieve superior zero-shot forecasting performance compared to existing TSF foundation models.

Image Reconstruction Time Series +1

65
0.68 stars / hour