Emerging Properties in Unified Multimodal Pretraining

ByteDance-Seed/Bagel 20 May 2025

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems.

Image Editing +3

4,167
0.33 stars / hour

MNN: A Universal and Efficient Inference Engine

alibaba/MNN 27 Feb 2020

Deploying deep learning models on mobile devices draws more and more attention recently.

Deep Learning Diversity +1

11,883
0.32 stars / hour

DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing

ucbepic/docetl 16 Oct 2024

Our evaluation on four different unstructured document analysis tasks demonstrates that DocETL finds plans with outputs that are 25 to 80% more accurate than well-engineered baselines, addressing a critical gap in unstructured data analysis.

2,259
0.31 stars / hour

Logits-Based Finetuning

dvlab-research/logits-based-finetuning 30 May 2025

We deeply explore the main contributors of OOD detection and find that reconstruction-based pretext tasks have the potential to provide a generally applicable and efficacious prior, which benefits the model in learning intrinsic data distributions of the ID dataset.

Out of Distribution (OOD) Detection

62
0.31 stars / hour

Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

paper2poster/paper2poster 27 May 2025

To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes.

2,093
0.31 stars / hour

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

visual-agent/deepeyes 20 May 2025

Large Vision-Language Models (VLMs) have shown strong capabilities in multimodal understanding and reasoning, yet they are primarily constrained by text-based reasoning processes.

Hallucination Mathematical Reasoning +4

492
0.30 stars / hour

MOS: Model Surgery for Pre-Trained Model-Based Class-Incremental Learning

sun-hailong/aaai25-mos 12 Dec 2024

Class-Incremental Learning (CIL) requires models to continually acquire knowledge of new classes without forgetting old ones.

class-incremental learning Class Incremental Learning +3

61
0.29 stars / hour

Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

fkryan/gazelle CVPR 2025

We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene.

Gaze Target Estimation

699
0.29 stars / hour

Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments

NVIDIA-Omniverse/Orbit 10 Jan 2023

We present Orbit, a unified and modular framework for robot learning powered by NVIDIA Isaac Sim.

Imitation Learning Motion Planning +5

3,923
0.28 stars / hour

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

jennyzzt/dgm 29 May 2025

The G\"odel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner.

Meta-Learning

1,317
0.28 stars / hour