1 code implementation • 11 Aug 2024 • Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li, Yizeng Han, Shiji Song, Gao Huang, Xiu Li
In response to this observation, we present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately.
no code implementations • 14 Jun 2024 • Zeyu Liu, Weicong Liang, Yiming Zhao, Bohan Chen, Lin Liang, Lijuan Wang, Ji Li, Yuhui Yuan
With the combination of these techniques, we deliver a powerful customized multilingual text encoder, Glyph-ByT5-v2, and a strong aesthetic graphic generation model, Glyph-SDXL-v2, that can support accurate spelling in 10 different languages.
no code implementations • 12 Jun 2024 • Xinzhi Mu, Li Chen, Bohan Chen, Shuyang Gu, Jianmin Bao, Dong Chen, Ji Li, Yuhui Yuan
This task essentially requires generating coherent and consistent visual content within the confines of a font-shaped canvas, as opposed to a traditional rectangular canvas.
no code implementations • 6 Jun 2024 • Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, Liang Zheng
To address this limitation, we propose Step-aware Preference Optimization (SPO), a novel post-training approach that independently evaluates and adjusts the denoising performance at each step, using a step-aware preference model and a step-wise resampler to ensure accurate step-aware supervision.
1 code implementation • 21 Mar 2024 • Yueru Jia, Yuhui Yuan, Aosong Cheng, Chuke Wang, Ji Li, Huizhu Jia, Shanghang Zhang
Second, we propose an instruction-guided latent fusion that pastes the multi-layered latent representations onto a canvas latent.
no code implementations • 14 Mar 2024 • Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, Yuhui Yuan
Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies.
no code implementations • 15 Dec 2023 • Nan Huang, Ting Zhang, Yuhui Yuan, Dong Chen, Shanghang Zhang
In this paper, we present a novel two-stage approach that fully utilizes the information provided by the reference image to establish a customized knowledge prior for image-to-3D generation.
no code implementations • 30 Nov 2023 • Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, Chong Luo, Yueyi Zhang, Zhiwei Xiong
Second, it preserves the high-fidelity generation ability of the pre-trained image diffusion models by making only minimal network modifications.
no code implementations • CVPR 2024 • Yanhui Wang, Jianmin Bao, Wenming Weng, Ruoyu Feng, Dacheng Yin, Tao Yang, Jingxu Zhang, Qi Dai Zhiyuan Zhao, Chunyu Wang, Kai Qiu, Yuhui Yuan, Chuanxin Tang, Xiaoyan Sun, Chong Luo, Baining Guo
We present MicroCinema, a straightforward yet effective framework for high-quality and coherent text-to-video generation.
no code implementations • 28 Nov 2023 • Peidong Jia, Chenxuan Li, Yuhui Yuan, Zeyu Liu, Yichao Shen, Bohan Chen, Xingru Chen, Yinglin Zheng, Dong Chen, Ji Li, Xiaodong Xie, Shanghang Zhang, Baining Guo
Our COLE system comprises multiple fine-tuned Large Language Models (LLMs), Large Multimodal Models (LMMs), and Diffusion Models (DMs), each specifically tailored for design-aware layer-wise captioning, layout planning, reasoning, and the task of generating images and text.
no code implementations • CVPR 2024 • Ruoyu Feng, Wenming Weng, Yanhui Wang, Yuhui Yuan, Jianmin Bao, Chong Luo, Zhibo Chen, Baining Guo
The versatility of our framework is demonstrated through a diverse range of choices in both structure representations and personalized T2I models, as well as the option to provide the edited key frame.
1 code implementation • ICCV 2023 • Xin Lai, Yuhui Yuan, Ruihang Chu, Yukang Chen, Han Hu, Jiaya Jia
Therefore, we abandon the mask attention design and resort to an auxiliary center regression task instead.
1 code implementation • ICCV 2023 • Frederic Z. Zhang, Yuhui Yuan, Dylan Campbell, Zhuoyao Zhong, Stephen Gould
Recently, the DETR framework has emerged as the dominant approach for human--object interaction (HOI) research.
Ranked #2 on Human-Object Interaction Detection on HICO-DET
1 code implementation • 8 Aug 2023 • Yichao Shen, Zigang Geng, Yuhui Yuan, Yutong Lin, Ze Liu, Chunyu Wang, Han Hu, Nanning Zheng, Baining Guo
We introduce a highly performant 3D object detector for point clouds using the DETR framework.
Ranked #2 on 3D Object Detection on ScanNetV2
no code implementations • 7 Aug 2023 • Zhanhao Liang, Yuhui Yuan
In this paper, we aim to study how to build a strong instance segmenter with minimal training time and GPUs, as opposed to the majority of current approaches that pursue more accurate instance segmenter by building more advanced frameworks at the cost of longer training time and higher GPU requirements.
Ranked #3 on Instance Segmentation on COCO minival (using extra training data)
1 code implementation • 3 Aug 2023 • Yutong Lin, Yuhui Yuan, Zheng Zhang, Chen Li, Nanning Zheng, Han Hu
This paper presents an improved DETR detector that maintains a "plain" nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that reintroduce architectural inductive biases of multi-scale and locality into the decoder.
no code implementations • 2 Aug 2023 • Yan Ma, Weicong Liang, Bohan Chen, Yiduo Hao, BoJian Hou, Xiangyu Yue, Chao Zhang, Yuhui Yuan
Motivated by the remarkable achievements of DETR-based approaches on COCO object detection and segmentation benchmarks, recent endeavors have been directed towards elevating their performance through self-supervised pre-training of Transformers while preserving a frozen backbone.
2 code implementations • CVPR 2024 • Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, Jiaya Jia
In this work, we propose a new segmentation task -- reasoning segmentation.
no code implementations • ICCV 2023 • Changqi Wang, Haoyu Xie, Yuhui Yuan, Chong Fu, Xiangyu Yue
To improve the robustness of representations, powerful methods introduce a pixel-wise contrastive learning approach in latent space (i. e., representation space) that aggregates the representations to their prototypes in a fully supervised manner.
1 code implementation • 12 Jun 2023 • Tianhe Ren, Shilong Liu, Feng Li, Hao Zhang, Ailing Zeng, Jie Yang, Xingyu Liao, Ding Jia, Hongyang Li, He Cao, Jianan Wang, Zhaoyang Zeng, Xianbiao Qi, Yuhui Yuan, Jianwei Yang, Lei Zhang
To address this issue, we develop a unified, highly modular, and lightweight codebase called detrex, which supports a majority of the mainstream DETR-based instance recognition algorithms, covering various fundamental tasks, including object detection, segmentation, and pose estimation.
1 code implementation • NeurIPS 2023 • Yukang Yang, Dongnan Gui, Yuhui Yuan, Weicong Liang, Haisong Ding, Han Hu, Kai Chen
We evaluate the effectiveness of our approach by measuring OCR-based metrics, CLIP score, and FID of the generated visual text.
1 code implementation • ICCV 2023 • Yutong Lin, Yuhui Yuan, Zheng Zhang, Chen Li, Nanning Zheng, Han Hu
This paper presents an improved DETR detector that maintains a "plain" nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that reintroduce architectural inductive biases of multi-scale and locality into the decoder.
no code implementations • 21 Nov 2022 • Zhihang Zhong, Mingxi Cheng, Zhirong Wu, Yuhui Yuan, Yinqiang Zheng, Ji Li, Han Hu, Stephen Lin, Yoichi Sato, Imari Sato
Image cropping has progressed tremendously under the data-driven paradigm.
4 code implementations • 3 Oct 2022 • Weicong Liang, Yuhui Yuan, Henghui Ding, Xiao Luo, WeiHong Lin, Ding Jia, Zheng Zhang, Chao Zhang, Han Hu
Vision transformers have recently achieved competitive results across various vision tasks but still suffer from heavy computation costs when processing a large number of tokens.
8 code implementations • CVPR 2023 • Ding Jia, Yuhui Yuan, Haodi He, Xiaopei Wu, Haojun Yu, WeiHong Lin, Lei Sun, Chao Zhang, Han Hu
One-to-one set matching is a key design for DETR to establish its end-to-end capability, so that object detection does not require a hand-crafted NMS (non-maximum suppression) to remove duplicate detections.
5 code implementations • 5 Apr 2022 • Jiequan Cui, Yuhui Yuan, Zhisheng Zhong, Zhuotao Tian, Han Hu, Stephen Lin, Jiaya Jia
In this paper, we study the problem of class imbalance in semantic segmentation.
Ranked #21 on Semantic Segmentation on ADE20K
2 code implementations • 8 Mar 2022 • Haodi He, Yuhui Yuan, Xiangyu Yue, Han Hu
Given an input image or video, our framework first conducts multi-label classification over the complete label, then sorts the complete label and selects a small subset according to their class confidence scores.
2 code implementations • NeurIPS 2021 • Yuhui Yuan, Rao Fu, Lang Huang, WeiHong Lin, Chao Zhang, Xilin Chen, Jingdong Wang
We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution representations and has high memory and computational cost.
1 code implementation • 18 Oct 2021 • Yuhui Yuan, Rao Fu, Lang Huang, WeiHong Lin, Chao Zhang, Xilin Chen, Jingdong Wang
We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution representations and has high memory and computational cost.
Ranked #3 on Pose Estimation on AIC
4 code implementations • ICCV 2021 • Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang
Our approach, named conditional DETR, learns a conditional spatial query from the decoder embedding for decoder multi-head cross-attention.
Ranked #108 on Object Detection on COCO minival
3 code implementations • CVPR 2021 • Xiaokang Chen, Yuhui Yuan, Gang Zeng, Jingdong Wang
Our approach imposes the consistency on two segmentation networks perturbed with different initialization for the same input image.
Ranked #2 on Semi-Supervised Semantic Segmentation on WoodScape
4 code implementations • ECCV 2020 • Yuhui Yuan, Jingyi Xie, Xilin Chen, Jingdong Wang
We present a model-agnostic post-processing scheme to improve the boundary quality for the segmentation result that is generated by any existing segmentation model.
1 code implementation • ICCV 2019 • Jianyuan Guo, Yuhui Yuan, Lang Huang, Chao Zhang, Jinge Yao, Kai Han
On the other hand, there still exist many useful contextual cues that do not fall into the scope of predefined human parts or attributes.
Ranked #59 on Person Re-Identification on DukeMTMC-reID
11 code implementations • ECCV 2020 • Yuhui Yuan, Xiaokang Chen, Xilin Chen, Jingdong Wang
We empirically demonstrate that the proposed approach achieves competitive performance on various challenging semantic segmentation benchmarks: Cityscapes, ADE20K, LIP, PASCAL-Context, and COCO-Stuff.
Ranked #5 on Semantic Segmentation on LIP val
6 code implementations • 29 Jul 2019 • Lang Huang, Yuhui Yuan, Jianyuan Guo, Chao Zhang, Xilin Chen, Jingdong Wang
There are two successive attention modules each estimating a sparse affinity matrix.
8 code implementations • 4 Sep 2018 • Yuhui Yuan, Lang Huang, Jianyuan Guo, Chao Zhang, Xilin Chen, Jingdong Wang
To capture richer context information, we further combine our interlaced sparse self-attention scheme with the conventional multi-scale context schemes including pyramid pooling~\citep{zhao2017pyramid} and atrous spatial pyramid pooling~\citep{chen2018deeplab}.
Ranked #9 on Semantic Segmentation on Trans10K
no code implementations • ICLR 2018 • Yuhui Yuan, Kuiyuan Yang, Chao Zhang
Thus, we propose feature incay to also regularize representation learning, which favors feature vectors with large norm when the samples can be correctly classified.
1 code implementation • ICCV 2017 • Yuhui Yuan, Kuiyuan Yang, Chao Zhang
This motivates us to ensemble a set of models with different complexities in cascaded manner and mine hard examples adaptively, a sample is judged by a series of models with increasing complexities and only updates models that consider the sample as a hard case.
Ranked #14 on Image Retrieval on SOP