1 code implementation • CVPR 2025 • Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou
In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances.
Ranked #7 on
Natural Language Visual Grounding
on ScreenSpot
no code implementations • 1 Oct 2024 • Shiwei Wu, Chen Zhang, Yan Gao, Qimeng Wang, Tong Xu, Yao Hu, Enhong Chen
Instructional documents are rich sources of knowledge for completing various tasks, yet their unique challenges in conversational question answering (CQA) have not been thoroughly explored.
no code implementations • 29 Aug 2024 • Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
1 code implementation • 6 Aug 2024 • Yanghai Zhang, Ye Liu, Shiwei Wu, Kai Zhang, Xukai Liu, Qi Liu, Enhong Chen
The rapid increase in multimedia data has spurred advancements in Multimodal Summarization with Multimodal Output (MSMO), which aims to produce a multimodal summary that integrates both text and relevant images.
no code implementations • CVPR 2024 • Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou
Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content.
no code implementations • 12 Jun 2024 • Shiwei Wu, Chao Zhang, Joya Chen, Tong Xu, Likang Wu, Yao Hu, Enhong Chen
People's social relationships are often manifested through their surroundings, with certain objects or interactions acting as symbols for specific relationships, e. g., wedding rings, roses, hugs, or holding hands.
1 code implementation • 27 May 2024 • Chao Zhang, Haoxin Zhang, Shiwei Wu, Di wu, Tong Xu, Xiangyu Zhao, Yan Gao, Yao Hu, Enhong Chen
While leveraging existing Multimodal Large Language Models (MLLMs) for such tasks is promising, challenges arise due to their delayed release compared to corresponding LLMs and the inefficiency in representation tasks.
no code implementations • 4 Mar 2024 • Chao Zhang, Shiwei Wu, Haoxin Zhang, Tong Xu, Yan Gao, Yao Hu, Di wu, Enhong Chen
Indeed, learning to generate hashtags/categories can potentially enhance note embeddings, both of which compress key note information into limited content.
no code implementations • 19 Feb 2024 • Yifei Cheng, Li Shen, Linli Xu, Xun Qian, Shiwei Wu, Yiming Zhou, Tie Zhang, DaCheng Tao, Enhong Chen
However, existing compression methods either perform only unidirectional compression in one iteration with higher communication cost, or bidirectional compression with slower convergence rate.
1 code implementation • 19 Jul 2023 • Pengfei Luo, Tong Xu, Shiwei Wu, Chen Zhu, Linli Xu, Enhong Chen
Then, to derive the similarity matching score for each mention-entity pair, we device three interaction units to comprehensively explore the intra-modal interaction and inter-modal fusion among features of entities and mentions.
1 code implementation • 26 Jun 2023 • Chao Zhang, Shiwei Wu, Sirui Zhao, Tong Xu, Enhong Chen
In this paper, we present a solution for enhancing video alignment to improve multi-step inference.
1 code implementation • 16 Mar 2023 • Shukang Yin, Shiwei Wu, Tong Xu, Shifeng Liu, Sirui Zhao, Enhong Chen
Automatic Micro-Expression (ME) spotting in long videos is a crucial step in ME analysis but also a challenging task due to the short duration and low intensity of MEs.
1 code implementation • 20 Jun 2022 • Shiwei Wu, Weidong He, Tong Xu, Hao Wang, Enhong Chen
Affordance-centric Question-driven Task Completion for Egocentric Assistant(AQTC) is a novel task which helps AI assistant learn from instructional videos and scripts and guide the user step-by-step.
15 code implementations • 11 Sep 2019 • Joya Chen, Dong Liu, Tong Xu, Shiwei Wu, Yifei Cheng, Enhong Chen
In this paper, we challenge the necessity of such hard/soft sampling methods for training accurate deep object detectors.