Efficient Part-level 3D Object Generation via Dual Volume Packing

nvlabs/partpacker 11 Jun 2025

Recent progress in 3D object generation has greatly improved both the quality and efficiency.

Diversity Object

354
0.81 stars / hour

AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents

hkuds/autoagent 9 Feb 2025

To address this challenge, we introduce AutoAgent-a Fully-Automated and highly Self-Developing framework that enables users to create and deploy LLM agents through Natural Language Alone.

RAG Retrieval-augmented Generation

5,019
0.74 stars / hour

EdgeTAM: On-Device Track Anything Model

facebookresearch/edgetam CVPR 2025

Given that video segmentation is a dense prediction task, we find preserving the spatial structure of the memories is essential so that the queries are split into global-level and patch-level groups.

model Video Segmentation +1

428
0.74 stars / hour

SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models

xid32/soundmind 15 Jun 2025

While large language models have shown reasoning capabilities, their application to the audio modality, particularly in large audio-language models (ALMs), remains significantly underdeveloped.

Logical Reasoning Reinforcement Learning (RL)

179
0.65 stars / hour

MAGREF: Masked Guidance for Any-Reference Video Generation

magref-video/magref 29 May 2025

Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches.

Video Generation

169
0.60 stars / hour

VGGT: Visual Geometry Grounded Transformer

facebookresearch/vggt CVPR 2025

We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views.

Depth Estimation Novel View Synthesis +3

8,758
0.58 stars / hour

VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use

VTOOL-R1/vtool-r1 25 May 2025

Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use.

Multimodal Reasoning Question Answering +2

84
0.58 stars / hour

R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration

zefan-cai/r-kv 30 May 2025

To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models.

Mathematical Reasoning

549
0.57 stars / hour

Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation

meigen-ai/multitalk 28 May 2025

Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos.

Human Animation Instruction Following +1

473
0.56 stars / hour

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

index-tts/index-tts 8 Feb 2025

Recently, large language model (LLM) based text-to-speech (TTS) systems have gradually become the mainstream in the industry due to their high naturalness and powerful zero-shot voice cloning capabilities. Here, we introduce the IndexTTS system, which is mainly based on the XTTS and Tortoise model.

Decoder Language Modeling +6

2,900
0.50 stars / hour