1 code implementation • 27 May 2025 • Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr
To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes.
1 code implementation • 22 May 2025 • Jiaqi Wang, Kevin Qinghong Lin, James Cheng, Mike Zheng Shou
This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards.
no code implementations • 19 Mar 2025 • Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A. Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M. Tamer Özsu, Aishwarya Agrawal, David Vazquez, Christopher Pal, Perouz Taslakian, Spandana Gella, Sai Rajeswar
We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments.
1 code implementation • 17 Mar 2025 • Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou
Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence.
Ranked #2 on
Zero-Shot Video Question Answer
on NExT-GQA
1 code implementation • CVPR 2025 • Kevin Qinghong Lin, Mike Zheng Shou
Human daily activities can be concisely narrated as sequences of routine events (e. g., turning off an alarm) in video streams, forming an event vocabulary.
no code implementations • CVPR 2025 • YuChao Gu, Yipin Zhou, Yunfan Ye, Yixin Nie, Licheng Yu, Pingchuan Ma, Kevin Qinghong Lin, Mike Zheng Shou
Natural language often struggles to accurately associate positional and attribute information with multiple instances, which limits current text-based visual generation models to simpler compositions featuring only a few dominant instances.
1 code implementation • CVPR 2025 • Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou
In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances.
Ranked #7 on
Natural Language Visual Grounding
on ScreenSpot
1 code implementation • CVPR 2025 • Weijia Wu, MingYu Liu, Zeyu Zhu, Xi Xia, Haoen Feng, Wen Wang, Kevin Qinghong Lin, Chunhua Shen, Mike Zheng Shou
Recent advancements in video generation models, like Stable Video Diffusion, show promising results, but primarily focus on short, single-scene videos.
no code implementations • 29 Aug 2024 • Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
1 code implementation • 22 Aug 2024 • Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, WeiHao Wang, Kevin Qinghong Lin, YuChao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou
We present a unified transformer, i. e., Show-o, that unifies multimodal understanding and generation.
Ranked #12 on
Image Generation
on WISE
1 code implementation • 31 Jul 2024 • Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Zheng Shou
In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts.
no code implementations • 19 Jun 2024 • Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou
The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks.
no code implementations • CVPR 2024 • Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou
Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content.
no code implementations • 14 Jun 2024 • Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen Wu, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks.
1 code implementation • 24 Apr 2024 • Jinheng Xie, Jiajun Feng, Zhaoxu Tian, Kevin Qinghong Lin, Yawen Huang, Xi Xia, Nanxu Gong, Xu Zuo, Jiaqi Yang, Yefeng Zheng, Mike Zheng Shou
Instead of operating on pixel space, it is efficient to employ visual locations like bounding boxes and keypoints to represent key information in videos, which can be simply discretized and then tokenized for consumption by GPT.
no code implementations • 1 Jan 2024 • Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, JianFeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
\ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters.
1 code implementation • CVPR 2024 • Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou
In this paper, we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way.
no code implementations • 29 Aug 2023 • Henghao Zhao, Kevin Qinghong Lin, Rui Yan, Zechao Li
An arbitrary setting can be used in DiffusionVMR during inference without consistency with the training phase.
Ranked #5 on
Video Grounding
on QVHighlights
1 code implementation • ICCV 2023 • Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou
Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels.
Ranked #8 on
Highlight Detection
on QVHighlights
(using extra training data)
1 code implementation • ICCV 2023 • Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, Pengchuan Zhang
Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks.
no code implementations • 14 Jun 2023 • Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, Mike Zheng Shou
2) Flexible inputs and intermediate results.
2 code implementations • ICCV 2023 • Alex Jinpeng Wang, Kevin Qinghong Lin, David Junhao Zhang, Stan Weixian Lei, Mike Zheng Shou
Specifically, TL;DR can compress the mainstream VLP datasets at a high ratio, e. g., reduce well-cleaned CC3M dataset from 2. 82M to 0. 67M ($\sim$24\%) and noisy YFCC15M from 15M to 2. 5M ($\sim$16. 7\%).
1 code implementation • 23 May 2023 • Jinheng Xie, Kai Ye, Yudong Li, Yuexiang Li, Kevin Qinghong Lin, Yefeng Zheng, Linlin Shen, Mike Zheng Shou
Experimental results demonstrate that VisorGPT can effectively model the visual prior, which can be employed for many vision tasks, such as customizing accurate human pose for conditional image synthesis models like ControlNet.
1 code implementation • CVPR 2023 • Joya Chen, Difei Gao, Kevin Qinghong Lin, Mike Zheng Shou
Humans excel at learning from expert demonstrations and solving their own problems.
Ranked #1 on
Video-to-image Affordance Grounding
on EPIC-Hotspot
1 code implementation • 23 Sep 2022 • Rong-Cheng Tu, Xian-Ling Mao, Kevin Qinghong Lin, Chengfei Cai, Weize Qin, Hongfa Wang, Wei Wei, Heyan Huang
Recently, to improve the unsupervised image retrieval performance, plenty of unsupervised hashing methods have been proposed by designing a semantic similarity matrix, which is based on the similarities between image features extracted by a pre-trained CNN model.
1 code implementation • 4 Jul 2022 • Kevin Qinghong Lin, Alex Jinpeng Wang, Rui Yan, Eric Zhongcong Xu, RongCheng Tu, Yanru Zhu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Wei Liu, Mike Zheng Shou
In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge.
1 code implementation • 4 Jul 2022 • Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, RongCheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou
In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for four Ego4D challenge tasks, including Natural Language Query (NLQ), Moment Query (MQ), Object State Change Classification (OSCC), and PNR Localization (PNR).
2 code implementations • 3 Jun 2022 • Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, RongCheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou
Video-Language Pretraining (VLP), which aims to learn transferable representation to advance a wide range of video-text downstream tasks, has recently received increasing attention.