An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

1 code implementation11 Mar 2024 Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang

To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones.

Computational Efficiency Video Understanding

GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields

no code implementations1 Jan 2024 Xiao Pan, Zongxin Yang, Shuai Bai, Yi Yang

Targeting these issues, we propose the GD$^2$-NeRF, a Generative Detail compensation framework via GAN and Diffusion that is both inference-time finetuning-free and with vivid plausible details.

Image to 3D Novel View Synthesis +1

TouchStone: Evaluating Vision-Language Models by Language Models

1 code implementation31 Aug 2023 Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, Jingren Zhou

Large vision-language models (LVLMs) have recently witnessed rapid advancements, exhibiting a remarkable capacity for perceiving, understanding, and processing visual information by connecting visual receptor with large language models (LLMs).

Visual Storytelling

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

2 code implementations18 May 2023 Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren Zhou, Xinggang Wang, Chang Zhou

In this work, we explore a scalable way for building a general representation model toward unlimited modalities.

 Ranked #1 on Semantic Segmentation on ADE20K (using extra training data)

Action Classification AudioCaps +16

OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

1 code implementation8 Dec 2022 Jinze Bai, Rui Men, Hao Yang, Xuancheng Ren, Kai Dang, Yichang Zhang, Xiaohuan Zhou, Peng Wang, Sinan Tan, An Yang, Zeyu Cui, Yu Han, Shuai Bai, Wenbin Ge, Jianxin Ma, Junyang Lin, Jingren Zhou, Chang Zhou

As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data.

Multi-Task Learning

Pretrained Diffusion Models for Unified Human Motion Synthesis

no code implementations6 Dec 2022 Jianxin Ma, Shuai Bai, Chang Zhou

Generative modeling of human motion has broad applications in computer animation, virtual reality, and robotics.

Motion Synthesis Open-Ended Question Answering

Single Stage Virtual Try-on via Deformable Attention Flows

1 code implementation19 Jul 2022 Shuai Bai, Huiling Zhou, Zhikang Li, Chang Zhou, Hongxia Yang

Virtual try-on aims to generate a photo-realistic fitting result given an in-shop garment and a reference person image.

Image Animation Virtual Try-on

Connecting Language and Vision for Natural Language-Based Vehicle Retrieval

1 code implementation31 May 2021 Shuai Bai, Zhedong Zheng, Xiaohan Wang, Junyang Lin, Zhu Zhang, Chang Zhou, Yi Yang, Hongxia Yang

In this paper, we apply one new modality, i. e., the language description, to search the vehicle of interest and explore the potential of this task in the real-world scenario.

Language Modelling Management +2

Dense Relation Distillation with Context-aware Aggregation for Few-Shot Object Detection

1 code implementation CVPR 2021 Hanzhe Hu, Shuai Bai, Aoxue Li, Jinshi Cui, LiWei Wang

In this work, aiming to fully exploit features of annotated novel object and capture fine-grained features of query object, we propose Dense Relation Distillation with Context-aware Aggregation (DCNet) to tackle the few-shot detection problem.

Few-Shot Object Detection Meta-Learning +3

Class-wise Dynamic Graph Convolution for Semantic Segmentation

no code implementations ECCV 2020 Hanzhe Hu, Deyi Ji, Weihao Gan, Shuai Bai, Wei Wu, Junjie Yan

Specifically, the CDGC module takes the coarse segmentation result as class mask to extract node features for graph construction and performs dynamic graph convolutions on the constructed graph to learn the feature aggregation and weight allocation.

graph construction Segmentation +1

Adaptive Dilated Network With Self-Correction Supervision for Counting

no code implementations CVPR 2020 Shuai Bai, Zhiqun He, Yu Qiao, Hanzhe Hu, Wei Wu, Junjie Yan

In this paper, we propose an adaptive dilated convolution and a novel supervised learning framework named self-correction (SC) supervision.

Multi-hierarchical Independent Correlation Filters for Visual Tracking

1 code implementation26 Nov 2018 Shuai Bai, Zhiqun He, Ting-Bing Xu, Zheng Zhu, Yuan Dong, Hongliang Bai

For visual tracking, most of the traditional correlation filters (CF) based methods suffer from the bottleneck of feature redundancy and lack of motion information.

Motion Estimation Visual Object Tracking +1

