Search Results for author: Shoufa Chen

Found 15 papers, 9 papers with code

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

no code implementations • 25 Feb 2024 • Yao Mu, Junting Chen, Qinglong Zhang, Shoufa Chen, Qiaojun Yu, Chongjian Ge, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, Peize Sun, Haibao Yu, Chao Yang, Wenqi Shao, Wenhai Wang, Jifeng Dai, Yu Qiao, Mingyu Ding, Ping Luo

Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI.

Ranked #72 on Visual Question Answering on MM-Vet

Code Generation Multimodal Reasoning +1

Paper
Add Code

GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation

no code implementations • 7 Dec 2023 • Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, Juan-Manuel Perez-Rua

In this study, we explore Transformer-based diffusion models for image and video generation.

Text-to-Video Generation Video Generation

Paper
Add Code

FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

no code implementations • 9 Oct 2023 • Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, Sen He

In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing.

Optical Flow Estimation Text-to-Video Editing +1

Paper
Add Code

Enhancing Your Trained DETRs with Box Refinement

1 code implementation • 21 Jul 2023 • Yiqun Chen, Qiang Chen, Peize Sun, Shoufa Chen, Jingdong Wang, Jian Cheng

We hope our work will bring the attention of the detection community to the localization bottleneck of current DETR-like models and highlight the potential of the RefineBox framework.

Paper
Code

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

2 code implementations • 7 Jul 2023 • Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, Ping Luo

Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence.

Ranked #1 on Visual Question Answering (VQA) on VCR (Q-AR) test

Attribute Common Sense Reasoning +4

453

Paper
Code

Going Denser with Open-Vocabulary Part Segmentation

2 code implementations • ICCV 2023 • Peize Sun, Shoufa Chen, Chenchen Zhu, Fanyi Xiao, Ping Luo, Saining Xie, Zhicheng Yan

In this paper, we propose a detector with the ability to predict both open-vocabulary objects and their part segmentation.

Object object-detection +3

361

Paper
Code

InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

2 code implementations • 9 May 2023 • Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, LiMin Wang, Ping Luo, Jifeng Dai, Yu Qiao

Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2.

Language Modelling

3,121

Paper
Code

Soft Neighbors are Positive Supporters in Contrastive Visual Representation Learning

no code implementations • 30 Mar 2023 • Chongjian Ge, Jiangliu Wang, Zhan Tong, Shoufa Chen, Yibing Song, Ping Luo

We evaluate our soft neighbor contrastive learning method (SNCLR) on standard visual recognition benchmarks, including image classification, object detection, and instance segmentation.

Contrastive Learning Image Classification +6

Paper
Add Code

DiffusionDet: Diffusion Model for Object Detection

3 code implementations • ICCV 2023 • Shoufa Chen, Peize Sun, Yibing Song, Ping Luo

We propose DiffusionDet, a new framework that formulates object detection as a denoising diffusion process from noisy boxes to object boxes.

Denoising Object +2

1,988

Paper
Code

CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer

1 code implementation • 17 Jun 2022 • Yao Mu, Shoufa Chen, Mingyu Ding, Jianyu Chen, Runjian Chen, Ping Luo

In visual control, learning transferable state representation that can transfer between different control tasks is important to reduce the training sample size.

Transfer Learning

Paper
Code

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

2 code implementations • 26 May 2022 • Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, Ping Luo

To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently.

Action Recognition Video Recognition

290

Paper
Code

Towards High-Quality Temporal Action Detection with Sparse Proposals

1 code implementation • 18 Sep 2021 • Jiannan Wu, Peize Sun, Shoufa Chen, Jiewen Yang, Zihao Qi, Lan Ma, Ping Luo

Towards high-quality temporal action detection, we introduce Sparse Proposals to interact with the hierarchical features.

Action Detection Avg +2

Paper
Code

CycleMLP: A MLP-like Architecture for Dense Prediction

8 code implementations • ICLR 2022 • Shoufa Chen, Enze Xie, Chongjian Ge, Runjian Chen, Ding Liang, Ping Luo

We build a family of models which surpass existing MLPs and even state-of-the-art Transformer-based models, e. g., Swin Transformer, while using fewer parameters and FLOPs.

Ranked #15 on Semantic Segmentation on DensePASS