VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

vita-mllm/vita-audio 6 May 2025

Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +7

144
0.86 stars / hour

PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides

icip-cas/pptagent 7 Jan 2025

Automatically generating presentations from documents is a challenging task that requires balancing content quality, visual design, and structural coherence.

1,298
0.73 stars / hour

EdgeTAM: On-Device Track Anything Model

facebookresearch/edgetam 13 Jan 2025

Given that video segmentation is a dense prediction task, we find preserving the spatial structure of the memories is essential so that the queries are split into global-level and patch-level groups.

model Video Segmentation +1

213
0.71 stars / hour

BayesFlow: Amortized Bayesian Workflows With Neural Networks

stefanradev93/BayesFlow 28 Jun 2023

Modern Bayesian inference involves a mixture of computational techniques for estimating, validating, and drawing conclusions from probabilistic models as part of principled workflows for data analysis.

Bayesian Inference Data Compression

550
0.68 stars / hour

SkyReels-V2: Infinite-length Film Generative Model

skyworkai/skyreels-v2 17 Apr 2025

Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions.

Large Language Model model +2

2,166
0.57 stars / hour

Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

facebookresearch/locate-3d 19 Apr 2025

LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities.

Decoder Object Localization +1

207
0.57 stars / hour

D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement

Peterande/D-FINE 17 Oct 2024

When pretrained on Objects365, D-FINE-L / X attains 57. 1% / 59. 3% AP, surpassing all existing real-time detectors.

Real-Time Object Detection regression

2,224
0.56 stars / hour

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

yfzhang114/r1_reward 5 May 2025

Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks.

Reinforcement Learning (RL)

106
0.55 stars / hour

Attentive Reasoning Queries: A Systematic Method for Optimizing Instruction-Following in Large Language Models

emcie-co/parlant 5 Mar 2025

We present Attentive Reasoning Queries (ARQs), a novel structured reasoning approach that significantly improves instruction-following in Large Language Models through domain-specialized reasoning blueprints.

Hallucination Instruction Following +1

2,695
0.55 stars / hour

Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer

Shi-qingyu/DeT 21 Mar 2025

In contrast, state-of-the-art video Diffusion Transformers (DiT) models use 3D full attention, which does not explicitly separate temporal and spatial information.

Benchmarking Video Generation

112
0.54 stars / hour