Video generation requires modeling a vast spatiotemporal space, which demands significant computational resources and data usage.
We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering.
We present a foundation model for zero-shot metric monocular depth estimation.
Motivated by this paradigm shift, we introduce DIAMOND (DIffusion As a Model Of eNvironment Dreams), a reinforcement learning agent trained in a diffusion world model.
Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to user needs.
Photo-realistic image restoration algorithms are typically evaluated by distortion measures (e. g., PSNR, SSIM) and by perceptual quality measures (e. g., FID, NIQE), where the desire is to attain the lowest possible distortion without compromising on perceptual quality.
Ranked #1 on Blind Face Restoration on CelebA-Test (FID metric)
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automating complex, multi-step tasks.
Previous robot learning methods often collect data to train with one specific embodiment for one task, which is expensive and prone to overfitting.
We have also contributed the first image composition toolbox: libcom https://github. com/bcmi/libcom, which assembles 10+ image composition related functions (e. g., image blending, image harmonization, object placement, shadow generation, generative composition).