1 code implementation • 13 Mar 2025 • Ju He, Qihang Yu, Qihao Liu, Liang-Chieh Chen
While conventional approaches treat the text modality as a conditioning signal that gradually guides the denoising process from Gaussian noise to the target image modality, we explore a much simpler paradigm-directly evolving between text and image modalities through flow matching.
1 code implementation • 27 Feb 2025 • Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen
In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a $k\times k$ grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image.
Ranked #1 on
Image Generation
on ImageNet 256x256
no code implementations • 26 Feb 2025 • Tiezheng Zhang, Qihang Yu, Alan Yuille, Ju He
Designed around Contrastive Components and Logical Constraints, CoCal rethinks existing cluster-based mask transformer architectures used in segmentation; Specifically, CoCal utilizes a set of dictionary components, with each component being explicitly linked to a specific semantic class.
no code implementations • 4 Feb 2025 • Xueqing Deng, Qihang Yu, Ali Athar, Chenglin Yang, Linjie Yang, Xiaojie Jin, Xiaohui Shen, Liang-Chieh Chen
This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.
1 code implementation • 13 Jan 2025 • Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, Liang-Chieh Chen
Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data.
1 code implementation • 19 Dec 2024 • Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen
Recently, in image generation, VAR proposes scale-wise autoregressive modeling, which extends the next token prediction to the next scale prediction, preserving the 2D structure of images.
Ranked #21 on
Image Generation
on ImageNet 256x256
1 code implementation • 1 Nov 2024 • Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks.
Ranked #10 on
Image Generation
on ImageNet 256x256
1 code implementation • 24 Sep 2024 • Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen
Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models.
Ranked #7 on
Image Reconstruction
on ImageNet
1 code implementation • 13 Jun 2024 • Qihao Liu, Zhanpeng Zeng, Ju He, Qihang Yu, Xiaohui Shen, Liang-Chieh Chen
This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization.
Ranked #20 on
Image Generation
on ImageNet 256x256
1 code implementation • 11 Jun 2024 • Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen
At ImageNet 512 x 512 benchmark, TiTok not only outperforms state-of-the-art diffusion model DiT-XL/2 (gFID 2. 74 vs. 3. 04), but also reduces the image tokens by 64x, leading to 410x faster generation process.
Ranked #8 on
Image Reconstruction
on ImageNet
no code implementations • 4 Jun 2024 • Inkyu Shin, Qihang Yu, Xiaohui Shen, In So Kweon, Kuk-Jin Yoon, Liang-Chieh Chen
In the second stage, we leverage the reconstruction ability developed in the first stage to impose the temporal constraints on the video diffusion model.
no code implementations • 22 Apr 2024 • Hongyun Yu, Zhan Qu, Qihang Yu, Jianchuan Chen, Zhonghua Jiang, Zhiwen Chen, Shengyu Zhang, Jimin Xu, Fei Wu, Chengfei Lv, Gang Yu
In this paper, we propose GaussianTalker, a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting.
no code implementations • CVPR 2024 • Xueqing Deng, Qihang Yu, Peng Wang, Xiaohui Shen, Liang-Chieh Chen
By enhancing the annotation quality and expanding the dataset to encompass 383K images with more than 5. 18M panoptic masks, we introduce COCONut, the COCO Next Universal segmenTation dataset.
2 code implementations • CVPR 2024 • Jieneng Chen, Qihang Yu, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen
To this end, we introduce ViTamin, a new vision models tailored for VLMs.
2 code implementations • 30 Nov 2023 • Ju He, Qihang Yu, Inkyu Shin, Xueqing Deng, Alan Yuille, Xiaohui Shen, Liang-Chieh Chen
In this work, we present Axial-VS, a general and simple framework that enhances video segmenters by tracking objects along axial trajectories.
Ranked #2 on
Video Panoptic Segmentation
on VIPSeg
1 code implementation • 14 Nov 2023 • Qihang Yu, Xiaohui Shen, Liang-Chieh Chen
Localizing and recognizing objects in the open-ended physical world poses a long-standing challenge within the domain of machine perception.
3 code implementations • 11 Oct 2023 • Jieneng Chen, Jieru Mei, Xianhang Li, Yongyi Lu, Qihang Yu, Qingyue Wei, Xiangde Luo, Yutong Xie, Ehsan Adeli, Yan Wang, Matthew Lungren, Lei Xing, Le Lu, Alan Yuille, Yuyin Zhou
In this paper, we extend the 2D TransUNet architecture to a 3D network by building upon the state-of-the-art nnU-Net architecture, and fully exploring Transformers' potential in both the encoder and decoder design.
1 code implementation • NeurIPS 2023 • Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen
The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining.
Ranked #1 on
Open Vocabulary Semantic Segmentation
on Cityscapes
Open Vocabulary Panoptic Segmentation
Open Vocabulary Semantic Segmentation
+2
1 code implementation • NeurIPS 2023 • Shuyang Sun, Weijun Wang, Qihang Yu, Andrew Howard, Philip Torr, Liang-Chieh Chen
This paper presents a new mechanism to facilitate the training of mask transformers for efficient panoptic segmentation, democratizing its deployment.
1 code implementation • CVPR 2023 • Ju He, Jieneng Chen, Ming-Xian Lin, Qihang Yu, Alan Yuille
Compositor achieves state-of-the-art performance on PartImageNet and Pascal-Part by outperforming previous methods by around 0. 9% and 1. 3% on PartImageNet, 0. 4% and 1. 7% on Pascal-Part in terms of part and object mIoU and demonstrates better robustness against occlusion by around 4. 4% and 7. 1% on part and object respectively.
no code implementations • 10 Apr 2023 • Inkyu Shin, Dahun Kim, Qihang Yu, Jun Xie, Hong-Seok Kim, Bradley Green, In So Kweon, Kuk-Jin Yoon, Liang-Chieh Chen
The meta architecture of the proposed Video-kMaX consists of two components: within clip segmenter (for clip-level segmentation) and cross-clip associater (for association beyond clips).
1 code implementation • 30 Mar 2023 • Lucas Beyer, Bo Wan, Gagan Madan, Filip Pavetic, Andreas Steiner, Alexander Kolesnikov, André Susano Pinto, Emanuele Bugliarello, Xiao Wang, Qihang Yu, Liang-Chieh Chen, Xiaohua Zhai
A key finding is that a small decoder learned on top of a frozen pretrained encoder works surprisingly well.
no code implementations • ICCV 2023 • Jieneng Chen, Yingda Xia, Jiawen Yao, Ke Yan, Jianpeng Zhang, Le Lu, Fakai Wang, Bo Zhou, Mingyan Qiu, Qihang Yu, Mingze Yuan, Wei Fang, Yuxing Tang, Minfeng Xu, Jian Zhou, Yuqian Zhao, Qifeng Wang, Xianghua Ye, Xiaoli Yin, Yu Shi, Xin Chen, Jingren Zhou, Alan Yuille, Zaiyi Liu, Ling Zhang
Human readers or radiologists routinely perform full-body multi-organ multi-disease detection and diagnosis in clinical practice, while most medical AI systems are built to focus on single organs with a narrow list of a few diseases.
2 code implementations • 4 Oct 2022 • Chenglin Yang, Siyuan Qiao, Qihang Yu, Xiaoding Yuan, Yukun Zhu, Alan Yuille, Hartwig Adam, Liang-Chieh Chen
The tiny-MOAT family is also benchmarked on downstream tasks, serving as a baseline for the community.
Ranked #1 on
Object Detection
on MS COCO
3 code implementations • 8 Jul 2022 • Qihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
However, we observe that most existing transformer-based vision models simply borrow the idea from NLP, neglecting the crucial difference between languages and images, particularly the extremely large sequence length of spatially flattened pixel features.
Ranked #2 on
Panoptic Segmentation
on COCO test-dev
2 code implementations • CVPR 2022 • Qihang Yu, Huiyu Wang, Dahun Kim, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
We propose Clustering Mask Transformer (CMT-DeepLab), a transformer-based framework for panoptic segmentation designed around clustering.
Ranked #6 on
Panoptic Segmentation
on COCO test-dev
no code implementations • CVPR 2022 • Dahun Kim, Jun Xie, Huiyu Wang, Siyuan Qiao, Qihang Yu, Hong-Seok Kim, Hartwig Adam, In So Kweon, Liang-Chieh Chen
We present TubeFormer-DeepLab, the first attempt to tackle multiple core video segmentation tasks in a unified manner.
1 code implementation • 2 Dec 2021 • Ju He, Shuo Yang, Shaokang Yang, Adam Kortylewski, Xiaoding Yuan, Jie-Neng Chen, Shuai Liu, Cheng Yang, Qihang Yu, Alan Yuille
To help address this problem, we propose PartImageNet, a large, high-quality dataset with part segmentation annotations.
4 code implementations • 17 Jun 2021 • Mark Weber, Huiyu Wang, Siyuan Qiao, Jun Xie, Maxwell D. Collins, Yukun Zhu, Liangzhe Yuan, Dahun Kim, Qihang Yu, Daniel Cremers, Laura Leal-Taixe, Alan L. Yuille, Florian Schroff, Hartwig Adam, Liang-Chieh Chen
DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a state-of-the-art and easy-to-use TensorFlow codebase for general dense pixel prediction problems in computer vision.
1 code implementation • NeurIPS 2021 • Qihang Yu, Yingda Xia, Yutong Bai, Yongyi Lu, Alan Yuille, Wei Shen
It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes, with the ability to efficiently model both long-range dependencies and local context.
22 code implementations • 8 Feb 2021 • Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L. Yuille, Yuyin Zhou
Medical image segmentation is an essential prerequisite for developing healthcare systems, especially for disease diagnosis and treatment planning.
Ranked #6 on
Medical Image Segmentation
on ACDC
1 code implementation • CVPR 2021 • Qihang Yu, Jianming Zhang, He Zhang, Yilin Wang, Zhe Lin, Ning Xu, Yutong Bai, Alan Yuille
We propose Mask Guided (MG) Matting, a robust matting framework that takes a general coarse mask as guidance.
no code implementations • 25 Nov 2020 • Yutong Bai, Haoqi Fan, Ishan Misra, Ganesh Venkatesh, Yongyi Lu, Yuyin Zhou, Qihang Yu, Vikas Chandra, Alan Yuille
To this end, we present Temporal-aware Contrastive self-supervised learningTaCo, as a general paradigm to enhance video CSL.
1 code implementation • ICLR 2021 • Yingwei Li, Qihang Yu, Mingxing Tan, Jieru Mei, Peng Tang, Wei Shen, Alan Yuille, Cihang Xie
To prevent models from exclusively attending on a single cue in representation learning, we augment training data with images with conflicting shape and texture information (eg, an image of chimpanzee shape but with lemon texture) and, most importantly, provide the corresponding supervisions from shape and texture simultaneously.
Ranked #656 on
Image Classification
on ImageNet
2 code implementations • CVPR 2020 • Yingwei Li, Xiaojie Jin, Jieru Mei, Xiaochen Lian, Linjie Yang, Cihang Xie, Qihang Yu, Yuyin Zhou, Song Bai, Alan Yuille
However, it has been rarely explored to embed the NL blocks in mobile neural networks, mainly due to the following challenges: 1) NL blocks generally have heavy computation cost which makes it difficult to be applied in applications where computational resources are limited, and 2) it is an open problem to discover an optimal configuration to embed NL blocks into mobile neural networks.
Ranked #60 on
Neural Architecture Search
on ImageNet
1 code implementation • 28 Mar 2020 • Qihang Yu, Yingwei Li, Jieru Mei, Yuyin Zhou, Alan L. Yuille
3D Convolution Neural Networks (CNNs) have been widely applied to 3D scene understanding, such as video analysis and volumetric image recognition.
no code implementations • 18 Mar 2020 • Yingda Xia, Qihang Yu, Wei Shen, Yuyin Zhou, Elliot K. Fishman, Alan L. Yuille
Pancreatic ductal adenocarcinoma (PDAC) is one of the most lethal cancers among the population.
no code implementations • 19 Feb 2020 • Yixiao Zhang, Xiaosong Wang, Ziyue Xu, Qihang Yu, Alan Yuille, Daguang Xu
In addition, we proposed a new evaluation metric for radiology image reporting with the assistance of the same composed graph.
no code implementations • CVPR 2020 • Qihang Yu, Dong Yang, Holger Roth, Yutong Bai, Yixiao Zhang, Alan L. Yuille, Daguang Xu
3D convolution neural networks (CNN) have been proved very successful in parsing organs or tumours in 3D medical images, but it remains sophisticated and time-consuming to choose or design proper 3D networks given different task contexts.
no code implementations • 2 Apr 2019 • Qihang Yu, Yingda Xia, Lingxi Xie, Elliot K. Fishman, Alan L. Yuille
With this design, we achieve a higher performance while maintaining a lower inference latency on a few abdominal organs from CT scans, in particular when the organ has a peculiar 3D shape and thus strongly requires contextual information, demonstrating our method's effectiveness and ability in capturing 3D information.
2 code implementations • CVPR 2018 • Qihang Yu, Lingxi Xie, Yan Wang, Yuyin Zhou, Elliot K. Fishman, Alan L. Yuille
The key innovation is a saliency transformation module, which repeatedly converts the segmentation probability map from the previous iteration as spatial weights and applies these weights to the current iteration.
Ranked #1 on
Pancreas Segmentation
on TCIA Pancreas-CT Dataset