DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Zhangwenyao1/DreamVLA 7 Jul 2025

However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information.

Image Generation Multimodal Reasoning +3

75
0.26 stars / hour

PresentAgent: Multimodal Agent for Presentation Video Generation

AIGeeksGroup/PresentAgent 5 Jul 2025

We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos.

text-to-speech Text to Speech +1

52
0.26 stars / hour

Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts

lileyang1210/ant 17 Apr 2025

Anchor-free methods risk disrupting sampling trajectories, leading to visual artifacts, while anchor-based methods rely on the heuristic selection of anchor concepts.

Denoising

48
0.25 stars / hour

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

aidc-ai/awesome-unified-multimodal-models 5 May 2025

Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation.

Survey Text to Image Generation +1

463
0.24 stars / hour

From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents

davidzwz/awesome-deep-research 23 Jun 2025

Supported by benchmark results and the rise of open-source implementations, we demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking.

Information Retrieval Retrieval

267
0.24 stars / hour

ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization

scu-zjz/forensichub 16 May 2025

The field of Fake Image Detection and Localization (FIDL) is highly fragmented, encompassing four domains: deepfake detection (Deepfake), image manipulation detection and localization (IMDL), artificial intelligence-generated image detection (AIGC), and document image manipulation localization (Doc).

All DeepFake Detection +5

90
0.24 stars / hour

Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library

alibaba/roll 6 Jun 2025

First, a single-controller architecture combined with an abstraction of the parallel worker simplifies the development of the training pipeline.

Management

1,469
0.23 stars / hour

Aligning Anime Video Generation with Human Feedback

bilibili/index-anisora 14 Apr 2025

Existing reward models, designed primarily for real-world videos, fail to capture the unique appearance and consistency requirements of anime.

Video Generation

1,813
0.22 stars / hour

Kinetics: Rethinking Test-Time Scaling Laws

infini-ai-lab/kinetics 5 Jun 2025

We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated.

67
0.22 stars / hour

Cradle: Empowering Foundation Agents Towards General Computer Control

baai-agents/cradle 5 Mar 2024

To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i. e., using screenshots as input and keyboard and mouse actions as output.

Efficient Exploration

2,193
0.21 stars / hour