1 code implementation • 26 Nov 2024 • Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou
In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances.
no code implementations • 25 Nov 2024 • Zechen Bai, Jianxiong Gao, Ziteng Gao, Pichao Wang, Zheng Zhang, Tong He, Mike Zheng Shou
Visual tokenizers are fundamental to image generation.
1 code implementation • 29 Sep 2024 • Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, Mike Zheng Shou
We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos.
1 code implementation • 22 Aug 2024 • Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, WeiHao Wang, Kevin Qinghong Lin, YuChao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou
We present a unified transformer, i. e., Show-o, that unifies multimodal understanding and generation.
no code implementations • 14 Aug 2024 • Zechen Bai, Tianjun Xiao, Tong He, Pichao Wang, Zheng Zhang, Thomas Brox, Mike Zheng Shou
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video, enhancing the effectiveness of text-video retrieval systems.
1 code implementation • CVPR 2024 • Ke Fan, Zechen Bai, Tianjun Xiao, Tong He, Max Horn, Yanwei Fu, Francesco Locatello, Zheng Zhang
Moreover, our analysis substantiates that our method exhibits the capability to dynamically adapt the slot number according to each instance's complexity, offering the potential for further exploration in slot attention research.
1 code implementation • 23 May 2024 • Henry Hengyuan Zhao, Pan Zhou, Difei Gao, Zechen Bai, Mike Zheng Shou
Our results demonstrate consistent performance gains, underscoring the critical role of these additional tasks in fostering comprehensive intelligence in MLLMs.
Ranked #125 on Visual Question Answering on MM-Vet
1 code implementation • 29 Apr 2024 • Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, Mike Zheng Shou
By drawing the granular classification and landscapes of hallucination causes, evaluation benchmarks, and mitigation methods, this survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field.
1 code implementation • 21 Feb 2024 • Zechen Bai, Peng Chen, Xiaolan Peng, Lu Liu, Hui Chen, Mike Zheng Shou, Feng Tian
In our solution, a deep learning model was first trained to retarget the facial expression from input face images to virtual human faces by estimating the blendshape coefficients.
2 code implementations • 2 Feb 2024 • Zongbo Han, Zechen Bai, Haiyang Mei, Qianli Xu, Changqing Zhang, Mike Zheng Shou
Recent advancements in large vision-language models (LVLMs) have demonstrated impressive capability in visual information understanding with human language.
no code implementations • CVPR 2024 • Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou
Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks thereby boosting human productivity.
no code implementations • 20 Dec 2023 • Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou
Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby boosting human productivity.
1 code implementation • ICCV 2023 • Ke Fan, Zechen Bai, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu Zhao, Carl-Johann Simon-Gabriel, Mike Zheng Shou, Francesco Locatello, Bernt Schiele, Thomas Brox, Zheng Zhang, Yanwei Fu, Tong He
In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization.
1 code implementation • ICCV 2023 • Zixu Zhao, Jiaze Wang, Max Horn, Yizhuo Ding, Tong He, Zechen Bai, Dominik Zietlow, Carl-Johann Simon-Gabriel, Bing Shuai, Zhuowen Tu, Thomas Brox, Bernt Schiele, Yanwei Fu, Francesco Locatello, Zheng Zhang, Tianjun Xiao
Unsupervised object-centric learning methods allow the partitioning of scenes into entities without additional localization information and are excellent candidates for reducing the annotation burden of multiple-object tracking (MOT) pipelines.
1 code implementation • ICCV 2021 • Zechen Bai, Yuta Nakashima, Noa Garcia
Have you ever looked at a painting and wondered what is the story behind it?
1 code implementation • CVPR 2021 • Zechen Bai, Zhigang Wang, Jian Wang, Di Hu, Errui Ding
Although achieving great success, most of them only use limited data from a single-source domain for model pre-training, making the rich labeled data insufficiently exploited.
no code implementations • 15 Jan 2020 • Li Wang, Zechen Bai, Yonghua Zhang, Hongtao Lu
SG and RWS are de-signed for the best use of recalled words.