no code implementations • 18 May 2025 • Longxi Gao, Li Zhang, Mengwei Xu
Using this training task, we propose UI-shift, a framework for enhancing VLM-based GUI agents through self-supervised reinforcement learning (RL).
1 code implementation • 21 Mar 2025 • Li Zhang, Longxi Gao, Mengwei Xu
Reasoning capabilities have significantly improved the performance of vision-language models (VLMs) in domains such as mathematical problem-solving, coding, and visual question-answering.
1 code implementation • 12 Apr 2024 • Li Zhang, Shihe Wang, Xianqing Jia, Zhihan Zheng, Yunhe Yan, Longxi Gao, Yuanchun Li, Mengwei Xu
LlamaTouch comprises three key techniques: (1) On-device task execution that enables mobile agents to interact with realistic mobile environments for task execution.