Search Results for author: Kevin Qinghong Lin

Found 28 papers, 19 papers with code

Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

1 code implementation27 May 2025 Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr

To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes.

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

1 code implementation22 May 2025 Jiaqi Wang, Kevin Qinghong Lin, James Cheng, Mike Zheng Shou

This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards.

Reinforcement Learning (RL)

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

1 code implementation17 Mar 2025 Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou

Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence.

Grounded Video Question Answering Temporal Localization +2

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

1 code implementation CVPR 2025 Kevin Qinghong Lin, Mike Zheng Shou

Human daily activities can be concisely narrated as sequences of routine events (e. g., turning off an alarm) in video streams, forming an event vocabulary.

EgoSchema Retrieval +1

ROICtrl: Boosting Instance Control for Visual Generation

no code implementations CVPR 2025 YuChao Gu, Yipin Zhou, Yunfan Ye, Yixin Nie, Licheng Yu, Pingchuan Ma, Kevin Qinghong Lin, Mike Zheng Shou

Natural language often struggles to accurately associate positional and attribute information with multiple instances, which limits current text-based visual generation models to simpler compositions featuring only a few dominant instances.

Attribute object-detection +1

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

1 code implementation CVPR 2025 Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou

In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances.

Instruction Following Natural Language Visual Grounding +1

MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation

1 code implementation CVPR 2025 Weijia Wu, MingYu Liu, Zeyu Zhu, Xi Xia, Haoen Feng, Wen Wang, Kevin Qinghong Lin, Chunhua Shen, Mike Zheng Shou

Recent advancements in video generation models, like Stable Video Diffusion, show promising results, but primarily focus on short, single-scene videos.

Video Generation

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

no code implementations29 Aug 2024 Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou

Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.

Learning Video Context as Interleaved Multimodal Sequences

1 code implementation31 Jul 2024 Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Zheng Shou

In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts.

Language Modeling Language Modelling +7

GUI Action Narrator: Where and When Did That Action Take Place?

no code implementations19 Jun 2024 Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou

The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks.

Optical Character Recognition (OCR) Video Captioning

VideoGUI: A Benchmark for GUI Automation from Instructional Videos

no code implementations14 Jun 2024 Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen Wu, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks.

Video Editing

Learning Long-form Video Prior via Generative Pre-Training

1 code implementation24 Apr 2024 Jinheng Xie, Jiajun Feng, Zhaoxu Tian, Kevin Qinghong Lin, Yawen Huang, Xi Xia, Nanxu Gong, Xu Zuo, Jiaqi Yang, Yefeng Zheng, Mike Zheng Shou

Instead of operating on pixel space, it is efficient to employ visual locations like bounding boxes and keypoints to represent key information in videos, which can be simply discretized and then tokenized for consumption by GPT.

Form

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

no code implementations1 Jan 2024 Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, JianFeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

\ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters.

Language Modelling Reading Comprehension +1

Bootstrapping SparseFormers from Vision Foundation Models

1 code implementation CVPR 2024 Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou

In this paper, we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way.

UniVTG: Towards Unified Video-Language Temporal Grounding

1 code implementation ICCV 2023 Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou

Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels.

Ranked #8 on Highlight Detection on QVHighlights (using extra training data)

Highlight Detection Moment Retrieval +3

Too Large; Data Reduction for Vision-Language Pre-Training

2 code implementations ICCV 2023 Alex Jinpeng Wang, Kevin Qinghong Lin, David Junhao Zhang, Stan Weixian Lei, Mike Zheng Shou

Specifically, TL;DR can compress the mainstream VLP datasets at a high ratio, e. g., reduce well-cleaned CC3M dataset from 2. 82M to 0. 67M ($\sim$24\%) and noisy YFCC15M from 15M to 2. 5M ($\sim$16. 7\%).

Decoder

VisorGPT: Learning Visual Prior via Generative Pre-Training

1 code implementation23 May 2023 Jinheng Xie, Kai Ye, Yudong Li, Yuexiang Li, Kevin Qinghong Lin, Yefeng Zheng, Linlin Shen, Mike Zheng Shou

Experimental results demonstrate that VisorGPT can effectively model the visual prior, which can be employed for many vision tasks, such as customizing accurate human pose for conditional image synthesis models like ControlNet.

Image Generation Language Modeling +2

Unsupervised Hashing with Semantic Concept Mining

1 code implementation23 Sep 2022 Rong-Cheng Tu, Xian-Ling Mao, Kevin Qinghong Lin, Chengfei Cai, Weize Qin, Hongfa Wang, Wei Wei, Heyan Huang

Recently, to improve the unsupervised image retrieval performance, plenty of unsupervised hashing methods have been proposed by designing a semantic similarity matrix, which is based on the similarities between image features extracted by a pre-trained CNN model.

Image Retrieval Prompt Engineering +4

Egocentric Video-Language Pretraining @ Ego4D Challenge 2022

1 code implementation4 Jul 2022 Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, RongCheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou

In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for four Ego4D challenge tasks, including Natural Language Query (NLQ), Moment Query (MQ), Object State Change Classification (OSCC), and PNR Localization (PNR).

Language Modeling Language Modelling +1

Cannot find the paper you are looking for? You can Submit a new open access paper.