Search Results for author: Wenqi Shao

Found 47 papers, 33 papers with code

PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models

1 code implementation17 Jun 2024 Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, Ping Luo

Our findings reveal that: (1) even advanced models frequently err in various physical scenarios, except for optics; (2) GPT-4o, with item-specific scoring instructions, effectively evaluates the models' understanding of physical commonsense, closely aligning with human assessments; and (3) current T2I models are primarily focused on text-to-image translation, lacking profound reasoning regarding physical commonsense.

Image Generation

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

1 code implementation13 Jun 2024 Tianle Zhang, Langtian Ma, Yuchen Yan, Yuchen Zhang, Kai Wang, Yue Yang, Ziyao Guo, Wenqi Shao, Yang You, Yu Qiao, Ping Luo, Kaipeng Zhang

To address these challenges, this paper introduces the Text-to-Video Human Evaluation (T2VHE) protocol, a comprehensive and standardized protocol for T2V models.

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

1 code implementation12 Jun 2024 Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, Ping Luo

Smartphone users often navigate across multiple applications (apps) to complete tasks such as sharing content between social media platforms.

Navigate

Needle In A Multimodal Haystack

1 code implementation11 Jun 2024 Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, Wenhai Wang

In this work, we present Needle In A Multimodal Haystack (MM-NIAH), the first benchmark specifically designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.

Retrieval

UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge

no code implementations23 May 2024 Chuanhao Li, Zhen Li, Chenchen Jing, Shuo Liu, Wenqi Shao, Yuwei Wu, Ping Luo, Yu Qiao, Kaipeng Zhang

In this paper, we propose a plug-and-play framework, for augmenting existing LVLMs in handling visual question answering (VQA) about up-to-date knowledge, dubbed UDKAG.

Question Answering Visual Question Answering

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

2 code implementations9 May 2024 Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu Qiao, Hongsheng Li

Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details.

Adapting LLaMA Decoder to Vision Transformer

1 code implementation10 Apr 2024 Jiahao Wang, Wenqi Shao, Mengzhao Chen, Chengyue Wu, Yong liu, Taiqiang Wu, Kaipeng Zhang, Songyang Zhang, Kai Chen, Ping Luo

We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training.

Computational Efficiency Decoder +2

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

1 code implementation CVPR 2024 Lirui Zhao, Yue Yang, Kaipeng Zhang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Rongrong Ji

Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research.

Diversity Language Modelling +1

AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Instructions

no code implementations14 Mar 2024 Hao Zhang, Wenqi Shao, Hong Liu, Yongqiang Ma, Ping Luo, Yu Qiao, Kaipeng Zhang

To bridge this gap, we introduce AVIBench, a framework designed to analyze the robustness of LVLMs when facing various adversarial visual-instructions (AVIs), including four types of image-based AVIs, ten types of text-based AVIs, and nine types of content bias AVIs (such as gender, violence, cultural, and racial biases, among others).

Fairness Language Modelling

Position: Towards Implicit Prompt For Text-To-Image Models

no code implementations4 Mar 2024 Yue Yang, Yuqi Lin, Hong Liu, Wenqi Shao, Runjian Chen, Hailong Shang, Yu Wang, Yu Qiao, Kaipeng Zhang, Ping Luo

We call for increased attention to the potential and risks of implicit prompts in the T2I community and further investigation into the capabilities and impacts of implicit prompts, advocating for a balanced approach that harnesses their benefits while mitigating their risks.

Position

RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation

no code implementations22 Feb 2024 Junting Chen, Yao Mu, Qiaojun Yu, Tianming Wei, Silang Wu, Zhecheng Yuan, Zhixuan Liang, Chao Yang, Kaipeng Zhang, Wenqi Shao, Yu Qiao, Huazhe Xu, Mingyu Ding, Ping Luo

To bridge this ``ideal-to-real'' gap, this paper presents \textbf{RobotScript}, a platform for 1) a deployable robot manipulation pipeline powered by code generation; and 2) a code generation benchmark for robot manipulation tasks in free-form natural language.

Code Generation Common Sense Reasoning +2

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

2 code implementations18 Feb 2024 Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo

Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc.

Question Answering Text Summarization

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

1 code implementation CVPR 2024 Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, Ping Luo

Importantly, all images in this benchmark are sourced from authentic medical scenarios, ensuring alignment with the requirements of the medical field and suitability for evaluating LVLMs.

Medical Visual Question Answering Question Answering +1

Two Trades is not Baffled: Condensing Graph via Crafting Rational Gradient Matching

1 code implementation7 Feb 2024 Tianle Zhang, Yuchen Zhang, Kun Wang, Kai Wang, Beining Yang, Kaipeng Zhang, Wenqi Shao, Ping Liu, Joey Tianyi Zhou, Yang You

Training on large-scale graphs has achieved remarkable results in graph representation learning, but its cost and storage have raised growing concerns.

Graph Representation Learning

Cached Transformers: Improving Transformers with Differentiable Memory Cache

1 code implementation20 Dec 2023 Zhaoyang Zhang, Wenqi Shao, Yixiao Ge, Xiaogang Wang, Jinwei Gu, Ping Luo

This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens.

Image Classification Instance Segmentation +6

MLLMs-Augmented Visual-Language Representation Learning

1 code implementation30 Nov 2023 Yanqing Liu, Kai Wang, Wenqi Shao, Ping Luo, Yu Qiao, Mike Zheng Shou, Kaipeng Zhang, Yang You

Visual-language pre-training has achieved remarkable success in many multi-modal tasks, largely attributed to the availability of large-scale image-text datasets.

Image-text Retrieval Representation Learning +1

DiffusionMat: Alpha Matting as Sequential Refinement Learning

no code implementations22 Nov 2023 Yangyang Xu, Shengfeng He, Wenqi Shao, Kwan-Yee K. Wong, Yu Qiao, Ping Luo

In this paper, we introduce DiffusionMat, a novel image matting framework that employs a diffusion model for the transition from coarse to refined alpha mattes.

Denoising Image Matting

Tree-Planner: Efficient Close-loop Task Planning with Large Language Models

no code implementations12 Oct 2023 Mengkang Hu, Yao Mu, Xinmiao Yu, Mingyu Ding, Shiguang Wu, Wenqi Shao, Qiguang Chen, Bin Wang, Yu Qiao, Ping Luo

This paper studies close-loop task planning, which refers to the process of generating a sequence of skills (a plan) to accomplish a specific goal while adapting the plan based on real-time observations.

Decision Making

Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching

no code implementations8 Oct 2023 Hao Zhang, Lumin Xu, Shenqi Lai, Wenqi Shao, Nanning Zheng, Ping Luo, Yu Qiao, Kaipeng Zhang

Current image-based keypoint detection methods for animal (including human) bodies and faces are generally divided into full-supervised and few-shot class-agnostic approaches.

Keypoint Detection Open Vocabulary Keypoint Detection

SPOT: Scalable 3D Pre-training via Occupancy Prediction for Autonomous Driving

1 code implementation19 Sep 2023 Xiangchao Yan, Runjian Chen, Bo Zhang, Jiakang Yuan, Xinyu Cai, Botian Shi, Wenqi Shao, Junchi Yan, Ping Luo, Yu Qiao

Our contributions are threefold: (1) Occupancy prediction is shown to be promising for learning general representations, which is demonstrated by extensive experiments on plenty of datasets and tasks.

3D Object Detection Autonomous Driving +3

Foundation Model is Efficient Multimodal Multitask Model Selector

1 code implementation NeurIPS 2023 Fanqing Meng, Wenqi Shao, Zhanglin Peng, Chonghe Jiang, Kaipeng Zhang, Yu Qiao, Ping Luo

This paper investigates an under-explored but important problem: given a collection of pre-trained neural networks, predicting their performance on each multi-modal task without fine-tuning them, such as image recognition, referring, captioning, visual question answering, and text question answering.

Model Selection Question Answering +1

Tiny LVLM-eHub: Early Multimodal Experiments with Bard

1 code implementation7 Aug 2023 Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao, Ping Luo

Secondly, it conducts an in-depth analysis of LVLMs' predictions using the ChatGPT Ensemble Evaluation (CEE), which leads to a robust and accurate evaluation and exhibits improved alignment with human evaluation compared to the word matching approach.

Hallucination Visual Reasoning

Align, Adapt and Inject: Sound-guided Unified Image Generation

no code implementations20 Jun 2023 Yue Yang, Kaipeng Zhang, Yuying Ge, Wenqi Shao, Zeyue Xue, Yu Qiao, Ping Luo

Then, we propose the audio adapter to adapt audio representation into an audio token enriched with specific semantics, which can be injected into a frozen T2I model flexibly.

Image Generation Text Retrieval

Real-time Controllable Denoising for Image and Video

1 code implementation CVPR 2023 Zhaoyang Zhang, Yitong Jiang, Wenqi Shao, Xiaogang Wang, Ping Luo, Kaimo Lin, Jinwei Gu

Controllable image denoising aims to generate clean samples with human perceptual priors and balance sharpness and smoothness.

Image Denoising Video Denoising

Not All Models Are Equal: Predicting Model Transferability in a Self-challenging Fisher Space

1 code implementation7 Jul 2022 Wenqi Shao, Xun Zhao, Yixiao Ge, Zhaoyang Zhang, Lei Yang, Xiaogang Wang, Ying Shan, Ping Luo

It is challenging because the ground-truth model ranking for each task can only be generated by fine-tuning the pre-trained models on the target dataset, which is brute-force and computationally expensive.

Transferability

CO^3: Cooperative Unsupervised 3D Representation Learning for Autonomous Driving

1 code implementation8 Jun 2022 Runjian Chen, Yao Mu, Runsen Xu, Wenqi Shao, Chenhan Jiang, Hang Xu, Zhenguo Li, Ping Luo

In this paper, we propose CO^3, namely Cooperative Contrastive Learning and Contextual Shape Prediction, to learn 3D representation for outdoor-scene point clouds in an unsupervised manner.

Autonomous Driving Contrastive Learning +1

BWCP: Probabilistic Learning-to-Prune Channels for ConvNets via Batch Whitening

no code implementations13 May 2021 Wenqi Shao, Hang Yu, Zhaoyang Zhang, Hang Xu, Zhenguo Li, Ping Luo

To address this problem, we develop a probability-based pruning algorithm, called batch whitening channel pruning (BWCP), which can stochastically discard unimportant channels by modeling the probability of a channel being activated.

What Makes for End-to-End Object Detection?

1 code implementation10 Dec 2020 Peize Sun, Yi Jiang, Enze Xie, Wenqi Shao, Zehuan Yuan, Changhu Wang, Ping Luo

We identify that classification cost in matching cost is the main ingredient: (1) previous detectors only consider location cost, (2) by additionally introducing classification cost, previous detectors immediately produce one-to-one prediction during inference.

General Classification Object +2

Convolution-Weight-Distribution Assumption: Rethinking the Criteria of Channel Pruning

no code implementations24 Apr 2020 Zhongzhan Huang, Wenqi Shao, Xinjiang Wang, Liang Lin, Ping Luo

Channel pruning is a popular technique for compressing convolutional neural networks (CNNs), where various pruning criteria have been proposed to remove the redundant filters.

Channel Equilibrium Networks for Learning Deep Representation

1 code implementation ICML 2020 Wenqi Shao, Shitao Tang, Xingang Pan, Ping Tan, Xiaogang Wang, Ping Luo

Unlike prior arts that simply removed the inhibited channels, we propose to "wake them up" during training by designing a novel neural building block, termed Channel Equilibrium (CE) block, which enables channels at the same layer to contribute equally to the learned representation.

Channel Equilibrium Networks

no code implementations25 Sep 2019 Wenqi Shao, Shitao Tang, Xingang Pan, Ping Tan, Xiaogang Wang, Ping Luo

However, over-sparse CNNs have many collapsed channels (i. e. many channels with undesired zero values), impeding their learning ability.

Differentiable Learning-to-Group Channels via Groupable Convolutional Neural Networks

no code implementations ICCV 2019 Zhaoyang Zhang, Jingyu Li, Wenqi Shao, Zhanglin Peng, Ruimao Zhang, Xiaogang Wang, Ping Luo

ResNeXt, still suffers from the sub-optimal performance due to manually defining the number of groups as a constant over all of the layers.

SSN: Learning Sparse Switchable Normalization via SparsestMax

1 code implementation CVPR 2019 Wenqi Shao, Tianjian Meng, Jingyu Li, Ruimao Zhang, Yudian Li, Xiaogang Wang, Ping Luo

Unlike $\ell_1$ and $\ell_0$ constraints that impose difficulties in optimization, we turn this constrained optimization problem into feed-forward computation by proposing SparsestMax, which is a sparse version of softmax.

Learning Efficient Detector with Semi-supervised Adaptive Distillation

1 code implementation2 Jan 2019 Shitao Tang, Litong Feng, Wenqi Shao, Zhanghui Kuang, Wei zhang, Yimin Chen

ADL enlarges the distillation loss for hard-to-learn and hard-to-mimic samples and reduces distillation loss for the dominant easy samples, enabling distillation to work on the single-stage detector first time, even if the student and the teacher are identical.

Image Classification Knowledge Distillation +1

Towards Understanding Regularization in Batch Normalization

1 code implementation ICLR 2019 Ping Luo, Xinjiang Wang, Wenqi Shao, Zhanglin Peng

Batch Normalization (BN) improves both convergence and generalization in training neural networks.

Cannot find the paper you are looking for? You can Submit a new open access paper.