Search Results for author: Shiwei Wu

Found 14 papers, 8 papers with code

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

1 code implementation CVPR 2025 Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou

In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances.

Instruction Following Natural Language Visual Grounding +1

Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents

no code implementations1 Oct 2024 Shiwei Wu, Chen Zhang, Yan Gao, Qimeng Wang, Tong Xu, Yao Hu, Enhong Chen

Instructional documents are rich sources of knowledge for completing various tasks, yet their unique challenges in conversational question answering (CQA) have not been thoroughly explored.

Benchmarking Conversational Question Answering

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

no code implementations29 Aug 2024 Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou

Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.

Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization

1 code implementation6 Aug 2024 Yanghai Zhang, Ye Liu, Shiwei Wu, Kai Zhang, Xukai Liu, Qi Liu, Enhong Chen

The rapid increase in multimedia data has spurred advancements in Multimodal Summarization with Multimodal Output (MSMO), which aims to produce a multimodal summary that integrates both text and relevant images.

Knowledge Distillation Language Modeling +1

From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition

no code implementations12 Jun 2024 Shiwei Wu, Chao Zhang, Joya Chen, Tong Xu, Likang Wu, Yao Hu, Enhong Chen

People's social relationships are often manifested through their surroundings, with certain objects or interactions acting as symbols for specific relationships, e. g., wedding rings, roses, hugs, or holding hands.

Descriptive Visual Social Relationship Recognition

NoteLLM-2: Multimodal Large Representation Models for Recommendation

1 code implementation27 May 2024 Chao Zhang, Haoxin Zhang, Shiwei Wu, Di wu, Tong Xu, Xiangyu Zhao, Yan Gao, Yao Hu, Enhong Chen

While leveraging existing Multimodal Large Language Models (MLLMs) for such tasks is promising, challenges arise due to their delayed release compared to corresponding LLMs and the inefficiency in representation tasks.

In-Context Learning

NoteLLM: A Retrievable Large Language Model for Note Recommendation

no code implementations4 Mar 2024 Chao Zhang, Shiwei Wu, Haoxin Zhang, Tong Xu, Yan Gao, Yao Hu, Di wu, Enhong Chen

Indeed, learning to generate hashtags/categories can potentially enhance note embeddings, both of which compress key note information into limited content.

Contrastive Learning Language Modeling +2

Communication-Efficient Distributed Learning with Local Immediate Error Compensation

no code implementations19 Feb 2024 Yifei Cheng, Li Shen, Linli Xu, Xun Qian, Shiwei Wu, Yiming Zhou, Tie Zhang, DaCheng Tao, Enhong Chen

However, existing compression methods either perform only unidirectional compression in one iteration with higher communication cost, or bidirectional compression with slower convergence rate.

Multi-Grained Multimodal Interaction Network for Entity Linking

1 code implementation19 Jul 2023 Pengfei Luo, Tong Xu, Shiwei Wu, Chen Zhu, Linli Xu, Enhong Chen

Then, to derive the similarity matching score for each mention-entity pair, we device three interaction units to comprehensively explore the intra-modal interaction and inter-modal fusion among features of entities and mentions.

Contrastive Learning Descriptive +2

A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step Inference

1 code implementation26 Jun 2023 Chao Zhang, Shiwei Wu, Sirui Zhao, Tong Xu, Enhong Chen

In this paper, we present a solution for enhancing video alignment to improve multi-step inference.

Video Alignment

AU-aware graph convolutional network for Macro- and Micro-expression spotting

1 code implementation16 Mar 2023 Shukang Yin, Shiwei Wu, Tong Xu, Shifeng Liu, Sirui Zhao, Enhong Chen

Automatic Micro-Expression (ME) spotting in long videos is a crucial step in ME analysis but also a challenging task due to the short duration and low intensity of MEs.

Micro-Expression Spotting

Winning the CVPR'2022 AQTC Challenge: A Two-stage Function-centric Approach

1 code implementation20 Jun 2022 Shiwei Wu, Weidong He, Tong Xu, Hao Wang, Enhong Chen

Affordance-centric Question-driven Task Completion for Egocentric Assistant(AQTC) is a novel task which helps AI assistant learn from instructional videos and scripts and guide the user step-by-step.

Is Heuristic Sampling Necessary in Training Deep Object Detectors?

15 code implementations11 Sep 2019 Joya Chen, Dong Liu, Tong Xu, Shiwei Wu, Yifei Cheng, Enhong Chen

In this paper, we challenge the necessity of such hard/soft sampling methods for training accurate deep object detectors.

Diagnostic General Classification +3

Cannot find the paper you are looking for? You can Submit a new open access paper.