no code implementations • 25 Feb 2024 • Yao Mu, Junting Chen, Qinglong Zhang, Shoufa Chen, Qiaojun Yu, Chongjian Ge, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, Peize Sun, Haibao Yu, Chao Yang, Wenqi Shao, Wenhai Wang, Jifeng Dai, Yu Qiao, Mingyu Ding, Ping Luo
Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI.
Ranked #72 on Visual Question Answering on MM-Vet
no code implementations • 7 Dec 2023 • Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, Juan-Manuel Perez-Rua
In this study, we explore Transformer-based diffusion models for image and video generation.
no code implementations • 9 Oct 2023 • Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, Sen He
In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing.
1 code implementation • 21 Jul 2023 • Yiqun Chen, Qiang Chen, Peize Sun, Shoufa Chen, Jingdong Wang, Jian Cheng
We hope our work will bring the attention of the detection community to the localization bottleneck of current DETR-like models and highlight the potential of the RefineBox framework.
2 code implementations • 7 Jul 2023 • Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, Ping Luo
Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence.
Ranked #1 on Visual Question Answering (VQA) on VCR (Q-AR) test
2 code implementations • ICCV 2023 • Peize Sun, Shoufa Chen, Chenchen Zhu, Fanyi Xiao, Ping Luo, Saining Xie, Zhicheng Yan
In this paper, we propose a detector with the ability to predict both open-vocabulary objects and their part segmentation.
2 code implementations • 9 May 2023 • Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, LiMin Wang, Ping Luo, Jifeng Dai, Yu Qiao
Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2.
no code implementations • 30 Mar 2023 • Chongjian Ge, Jiangliu Wang, Zhan Tong, Shoufa Chen, Yibing Song, Ping Luo
We evaluate our soft neighbor contrastive learning method (SNCLR) on standard visual recognition benchmarks, including image classification, object detection, and instance segmentation.
3 code implementations • ICCV 2023 • Shoufa Chen, Peize Sun, Yibing Song, Ping Luo
We propose DiffusionDet, a new framework that formulates object detection as a denoising diffusion process from noisy boxes to object boxes.
1 code implementation • 17 Jun 2022 • Yao Mu, Shoufa Chen, Mingyu Ding, Jianyu Chen, Runjian Chen, Ping Luo
In visual control, learning transferable state representation that can transfer between different control tasks is important to reduce the training sample size.
2 code implementations • 26 May 2022 • Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, Ping Luo
To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently.
1 code implementation • 18 Sep 2021 • Jiannan Wu, Peize Sun, Shoufa Chen, Jiewen Yang, Zihao Qi, Lan Ma, Ping Luo
Towards high-quality temporal action detection, we introduce Sparse Proposals to interact with the hierarchical features.
8 code implementations • ICLR 2022 • Shoufa Chen, Enze Xie, Chongjian Ge, Runjian Chen, Ding Liang, Ping Luo
We build a family of models which surpass existing MLPs and even state-of-the-art Transformer-based models, e. g., Swin Transformer, while using fewer parameters and FLOPs.
Ranked #15 on Semantic Segmentation on DensePASS
no code implementations • ICCV 2021 • Shoufa Chen, Peize Sun, Enze Xie, Chongjian Ge, Jiannan Wu, Lan Ma, Jiajun Shen, Ping Luo
WOO takes a unified video backbone to simultaneously extract features for actor location and action classification.
no code implementations • 10 Dec 2019 • Shoufa Chen, Yunpeng Chen, Shuicheng Yan, Jiashi Feng
We demonstrate the effectiveness of our search strategy by conducting extensive experiments.