1 code implementation • 26 Nov 2024 • Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou
In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances.
Ranked #7 on
Natural Language Visual Grounding
on ScreenSpot
1 code implementation • 15 Nov 2024 • Siyuan Hu, Mingyu Ouyang, Difei Gao, Mike Zheng Shou
The recently released model, Claude 3. 5 Computer Use, stands out as the first frontier AI model to offer computer use in public beta as a graphical user interface (GUI) agent.
1 code implementation • 31 Jul 2024 • Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Zheng Shou
In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts.
no code implementations • 19 Jun 2024 • Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou
The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks.
no code implementations • CVPR 2024 • Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou
Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content.
no code implementations • 14 Jun 2024 • Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen Wu, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks.
1 code implementation • 23 May 2024 • Henry Hengyuan Zhao, Pan Zhou, Difei Gao, Zechen Bai, Mike Zheng Shou
Our results demonstrate consistent performance gains, underscoring the critical role of these additional tasks in fostering comprehensive intelligence in MLLMs.
Ranked #154 on
Visual Question Answering
on MM-Vet
no code implementations • 24 Jan 2024 • Juan Hu, Xin Liao, Difei Gao, Satoshi Tsutsui, Qian Wang, Zheng Qin, Mike Zheng Shou
Deepfake videos are becoming increasingly realistic, showing few tampering traces on facial areasthat vary between frames.
no code implementations • CVPR 2024 • Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou
Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks thereby boosting human productivity.
no code implementations • 20 Dec 2023 • Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou
Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby boosting human productivity.
1 code implementation • CVPR 2024 • Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou
In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space.
1 code implementation • 24 Oct 2023 • Jay Zhangjie Wu, Xiuyu Li, Difei Gao, Zhen Dong, Jinbin Bai, Aishani Singh, Xiaoyu Xiang, Youzeng Li, Zuwei Huang, Yuanxi Sun, Rui He, Feng Hu, Junhua Hu, Hai Huang, Hanyu Zhu, Xu Cheng, Jie Tang, Mike Zheng Shou, Kurt Keutzer, Forrest Iandola
In this paper we present a retrospective on the competition and describe the winning method.
1 code implementation • 27 Sep 2023 • David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, YuChao Gu, Difei Gao, Mike Zheng Shou
In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation.
Ranked #2 on
Text-to-Video Generation
on EvalCrafter Text-to-Video (ECTV) Dataset
(using extra training data)
no code implementations • 19 Aug 2023 • Juan Hu, Xin Liao, Difei Gao, Satoshi Tsutsui, Qian Wang, Zheng Qin, Mike Zheng Shou
In the recovering stage, the model focuses on randomly masking regions of interest (ROIs) and reconstructing real faces without unpredictable tampered traces, resulting in a relatively good recovery effect for real faces while a poor recovery effect for fake faces.
1 code implementation • ICCV 2023 • Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou
Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels.
Ranked #8 on
Natural Language Moment Retrieval
on TACoS
1 code implementation • 27 Jun 2023 • Zhijian Hou, Lei Ji, Difei Gao, Wanjun Zhong, Kun Yan, Chao Li, Wing-Kwong Chan, Chong-Wah Ngo, Nan Duan, Mike Zheng Shou
Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data.
no code implementations • 14 Jun 2023 • Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, Mike Zheng Shou
2) Flexible inputs and intermediate results.
1 code implementation • CVPR 2023 • Joya Chen, Difei Gao, Kevin Qinghong Lin, Mike Zheng Shou
Humans excel at learning from expert demonstrations and solving their own problems.
Ranked #1 on
Video-to-image Affordance Grounding
on EPIC-Hotspot
no code implementations • 3 Mar 2023 • Juan Hu, Xin Liao, Difei Gao, Satoshi Tsutsui, Qian Wang, Zheng Qin, Mike Zheng Shou
Specifically, given a real face image, we first pretrain a masked autoencoder to learn facial part consistency by dividing faces into three parts and randomly masking ROIs, which are then recovered based on the unmasked facial parts.
1 code implementation • CVPR 2023 • Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, Mike Zheng Shou
To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must.
Ranked #2 on
Video Question Answering
on AGQA 2.0 balanced
no code implementations • 16 Nov 2022 • Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan
This technical report describes the CONE approach for Ego4D Natural Language Queries (NLQ) Challenge in ECCV 2022.
1 code implementation • 22 Sep 2022 • Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan
This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query.
1 code implementation • 24 Aug 2022 • Stan Weixian Lei, Difei Gao, Jay Zhangjie Wu, Yuxuan Wang, Wei Liu, Mengmi Zhang, Mike Zheng Shou
However, CL on VQA involves not only the expansion of label sets (new Answer sets).
1 code implementation • 4 Jul 2022 • Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, RongCheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou
In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for four Ego4D challenge tasks, including Natural Language Query (NLQ), Moment Query (MQ), Object State Change Classification (OSCC), and PNR Localization (PNR).
2 code implementations • 3 Jun 2022 • Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, RongCheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou
Video-Language Pretraining (VLP), which aims to learn transferable representation to advance a wide range of video-text downstream tasks, has recently received increasing attention.
1 code implementation • 1 Apr 2022 • Yuxuan Wang, Difei Gao, Licheng Yu, Stan Weixian Lei, Matt Feiszli, Mike Zheng Shou
In this paper, we introduce a new dataset called Kinetic-GEB+.
Ranked #1 on
Boundary Captioning
on Kinetics-GEB+
4 code implementations • 8 Mar 2022 • Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, Mike Zheng Shou
In this paper, we define a new task called Affordance-centric Question-driven Task Completion, where the AI assistant should learn from instructional videos to provide step-by-step help in the user's view.
2 code implementations • 30 Nov 2021 • Stan Weixian Lei, Difei Gao, Yuxuan Wang, Dongxing Mao, Zihan Liang, Lingmin Ran, Mike Zheng Shou
In contrast, we present a new task called Task-oriented Question-driven Video Segment Retrieval (TQVSR).
no code implementations • ICCV 2021 • Difei Gao, Ruiping Wang, Ziyi Bai, Xilin Chen
Visual understanding goes well beyond the study of images or videos on the web.
1 code implementation • CVPR 2020 • Difei Gao, Ke Li, Ruiping Wang, Shiguang Shan, Xilin Chen
Then, we introduce three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities, so as to refine the features of nodes.
no code implementations • 8 Aug 2019 • Difei Gao, Ruiping Wang, Shiguang Shan, Xilin Chen
To comprehensively evaluate such abilities, we propose a VQA benchmark, CRIC, which introduces new types of questions about Compositional Reasoning on vIsion and Commonsense, and an evaluation metric integrating the correctness of answering and commonsense grounding.