Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems.
Ranked #2 on
Image Editing
on ImgEdit-Data
Deploying deep learning models on mobile devices draws more and more attention recently.
Our evaluation on four different unstructured document analysis tasks demonstrates that DocETL finds plans with outputs that are 25 to 80% more accurate than well-engineered baselines, addressing a critical gap in unstructured data analysis.
We deeply explore the main contributors of OOD detection and find that reconstruction-based pretext tasks have the potential to provide a generally applicable and efficacious prior, which benefits the model in learning intrinsic data distributions of the ID dataset.
To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes.
Large Vision-Language Models (VLMs) have shown strong capabilities in multimodal understanding and reasoning, yet they are primarily constrained by text-based reasoning processes.
Class-Incremental Learning (CIL) requires models to continually acquire knowledge of new classes without forgetting old ones.
We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene.
We present Orbit, a unified and modular framework for robot learning powered by NVIDIA Isaac Sim.
The G\"odel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner.