QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

ivcylc/qa-mdt 24 May 2024

In recent years, diffusion-based text-to-music (TTM) generation has gained prominence, offering an innovative approach to synthesizing musical content from textual descriptions.

Diversity Music Generation +1

388
0.26 stars / hour

Revisit Anything: Visual Place Recognition via Image Segment Retrieval

anyloc/revisit-anything 26 Sep 2024

This poses a fundamental challenge in matching two images of the same place captured from different camera viewpoints: "the similarity of what overlaps can be dominated by the dissimilarity of what does not overlap".

Image Segmentation Navigate +3

55
0.26 stars / hour

FAST-LIVO2: Fast, Direct LiDAR-Inertial-Visual Odometry

hku-mars/fast-livo2 26 Aug 2024

The fusion of both visual and LiDAR measurements is based on a single unified voxel map where the LiDAR module constructs the geometric structure for registering new LiDAR scans and the visual module attaches image patches to the LiDAR points.

Visual Odometry

890
0.25 stars / hour

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

ultralytics/ultralytics 21 Feb 2024

It can be used to obtain complete information, so that train-from-scratch models can achieve better results than state-of-the-art models pre-trained using large datasets, the comparison results are shown in Figure 1.

object-detection Real-Time Object Detection

29,514
0.25 stars / hour

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

ictnlp/llama-omni 10 Sep 2024

We build our model based on the latest Llama-3. 1-8B-Instruct model.

2,142
0.25 stars / hour

OverFlow: Putting flows on top of neural transducers for better TTS

coqui-ai/TTS 13 Nov 2022

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech.

Ranked #11 on Text-To-Speech Synthesis on LJSpeech (using extra training data)

Normalising Flows Speech Synthesis +2

34,339
0.24 stars / hour

CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

zibojia/COCOCO 18 Mar 2024

To this end, this paper proposes a novel text-guided video inpainting model that achieves better consistency, controllability and compatibility.

Image Inpainting Video Alignment +2

254
0.23 stars / hour

StyleBooth: Image Style Editing with Multimodal Instruction

modelscope/scepter 18 Apr 2024

We integrate encoded textual instruction and image exemplar as a unified condition for diffusion model, enabling the editing of original image following multimodal instructions.

375
0.23 stars / hour

Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction

salesforceairesearch/gemfilter 25 Sep 2024

Our research introduces a novel approach for the long context bottleneck to accelerate LLM inference and reduce GPU memory consumption.

Token Reduction

50
0.23 stars / hour

SoundStorm: Efficient Parallel Audio Generation

lucidrains/soundstorm-pytorch 16 May 2023

We present SoundStorm, a model for efficient, non-autoregressive audio generation.

Audio Generation

1,342
0.22 stars / hour