UI-TARS: Pioneering Automated GUI Interaction with Native Agents

bytedance/ui-tars 21 Jan 2025

This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e. g., keyboard and mouse operations).

3,953
0.29 stars / hour

TikZero: Zero-Shot Text-Guided Graphics Program Synthesis

potamides/detikzify 14 Mar 2025

Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available.

Program Synthesis

957
0.29 stars / hour

MonSter: Marry Monodepth to Stereo Unleashes Power

junda24/monster 15 Jan 2025

The refined monodepth is in turn guides stereo effectively at ill-posed regions.

Monocular Depth Estimation Stereo Matching +1

262
0.29 stars / hour

Auto-configuring Exploration-Exploitation Tradeoff in Evolutionary Computation via Deep Reinforcement Learning

GMC-DRL/MetaBox 12 Apr 2024

Evolutionary computation (EC) algorithms, renowned as powerful black-box optimizers, leverage a group of individuals to cooperatively search for the optimum.

Deep Reinforcement Learning

104
0.28 stars / hour

LLM4Ranking: An Easy-to-use Framework of Utilizing Large Language Models for Document Reranking

liuqi6777/llm4ranking 10 Apr 2025

Utilizing large language models (LLMs) for document reranking has been a popular and promising research direction in recent years, many studies are dedicated to improving the performance and efficiency of using LLMs for reranking.

38
0.28 stars / hour

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

dcdmllm/healthgpt 14 Feb 2025

To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called VL-Health.

Language Modeling Language Modelling +1

1,028
0.28 stars / hour

MinerU: An Open-Source Solution for Precise Document Content Extraction

opendatalab/mineru 27 Sep 2024

Document content analysis has been a crucial research area in computer vision.

Diversity Optical Character Recognition (OCR)

30,999
0.28 stars / hour

TerraTorch: The Geospatial Foundation Models Toolkit

IBM/terratorch 26 Mar 2025

TerraTorch is a fine-tuning and benchmarking toolkit for Geospatial Foundation Models built on PyTorch Lightning and tailored for satellite, weather, and climate data.

Benchmarking Decoder +2

402
0.26 stars / hour

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

aiming-lab/mdocagent 18 Mar 2025

These agents engage in multi-modal context retrieval, combining their individual insights to achieve a more comprehensive understanding of the document's content.

document understanding Question Answering +2

111
0.26 stars / hour

MedSAM2: Segment Anything in 3D Medical Images and Videos

bowang-lab/medsam2 4 Apr 2025

Medical image and video segmentation is a critical task for precision medicine, which has witnessed considerable progress in developing task or modality-specific and generalist models for 2D images.

Segmentation Video Segmentation +1

97
0.25 stars / hour