no code implementations • 4 Feb 2025 • Xueqing Deng, Qihang Yu, Ali Athar, Chenglin Yang, Linjie Yang, Xiaojie Jin, Xiaohui Shen, Liang-Chieh Chen
This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.
1 code implementation • 13 Jan 2025 • Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, Liang-Chieh Chen
Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data.
no code implementations • 24 Dec 2024 • Chenglin Yang, Celong Liu, Xueqing Deng, Dongwon Kim, Xing Mei, Xiaohui Shen, Liang-Chieh Chen
We present 1. 58-bit FLUX, the first successful approach to quantizing the state-of-the-art text-to-image generation model, FLUX. 1-dev, using 1. 58-bit weights (i. e., values in {-1, 0, +1}) while maintaining comparable performance for generating 1024 x 1024 images.
1 code implementation • 19 Dec 2024 • Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen
Recently, in image generation, VAR proposes scale-wise autoregressive modeling, which extends the next token prediction to the next scale prediction, preserving the 2D structure of images.
Ranked #15 on
Image Generation
on ImageNet 256x256
no code implementations • 12 Dec 2024 • Ali Athar, Xueqing Deng, Liang-Chieh Chen
Our benchmark evaluates models on both holistic/high-level understanding and language-guided, pixel-precise segmentation.
1 code implementation • 1 Nov 2024 • Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks.
Ranked #5 on
Image Generation
on ImageNet 256x256
1 code implementation • 24 Sep 2024 • Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen
Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models.
Ranked #7 on
Image Generation
on ImageNet 256x256
1 code implementation • 13 Jun 2024 • Qihao Liu, Zhanpeng Zeng, Ju He, Qihang Yu, Xiaohui Shen, Liang-Chieh Chen
This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization.
Ranked #14 on
Image Generation
on ImageNet 256x256
1 code implementation • 11 Jun 2024 • Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen
At ImageNet 512 x 512 benchmark, TiTok not only outperforms state-of-the-art diffusion model DiT-XL/2 (gFID 2. 74 vs. 3. 04), but also reduces the image tokens by 64x, leading to 410x faster generation process.
Ranked #8 on
Image Reconstruction
on ImageNet
no code implementations • 4 Jun 2024 • Inkyu Shin, Qihang Yu, Xiaohui Shen, In So Kweon, Kuk-Jin Yoon, Liang-Chieh Chen
In the second stage, we leverage the reconstruction ability developed in the first stage to impose the temporal constraints on the video diffusion model.
no code implementations • CVPR 2024 • Xueqing Deng, Qihang Yu, Peng Wang, Xiaohui Shen, Liang-Chieh Chen
By enhancing the annotation quality and expanding the dataset to encompass 383K images with more than 5. 18M panoptic masks, we introduce COCONut, the COCO Next Universal segmenTation dataset.
2 code implementations • CVPR 2024 • Jieneng Chen, Qihang Yu, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen
To this end, we introduce ViTamin, a new vision models tailored for VLMs.
no code implementations • 5 Jan 2024 • Jieru Mei, Liang-Chieh Chen, Alan Yuille, Cihang Xie
In this work, we introduce SPFormer, a novel Vision Transformer enhanced by superpixel representation.
1 code implementation • 11 Dec 2023 • Abdullah Rashwan, Jiageng Zhang, Ali Taalimi, Fan Yang, Xingyi Zhou, Chaochao Yan, Liang-Chieh Chen, Yeqing Li
With ResNet50 backbone, our MaskConver achieves 53. 6% PQ on the COCO panoptic val set, outperforming the modern convolution-based model, Panoptic FCN, by 9. 3% as well as transformer-based models such as Mask2Former (+1. 7% PQ) and kMaX-DeepLab (+0. 6% PQ).
Ranked #8 on
Panoptic Segmentation
on COCO test-dev
2 code implementations • 30 Nov 2023 • Ju He, Qihang Yu, Inkyu Shin, Xueqing Deng, Alan Yuille, Xiaohui Shen, Liang-Chieh Chen
In this work, we present Axial-VS, a general and simple framework that enhances video segmenters by tracking objects along axial trajectories.
Ranked #2 on
Video Panoptic Segmentation
on VIPSeg
1 code implementation • 14 Nov 2023 • Qihang Yu, Xiaohui Shen, Liang-Chieh Chen
Localizing and recognizing objects in the open-ended physical world poses a long-standing challenge within the domain of machine perception.
1 code implementation • 9 Nov 2023 • Xuan Yang, Liangzhe Yuan, Kimberly Wilber, Astuti Sharma, Xiuye Gu, Siyuan Qiao, Stephanie Debats, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Liang-Chieh Chen
Despite this shift, methods based on the per-pixel prediction paradigm still dominate the benchmarks on the other dense prediction tasks that require continuous outputs, such as depth estimation and surface normal prediction.
Ranked #2 on
Surface Normals Estimation
on NYU Depth v2
no code implementations • 28 Sep 2023 • Alex Zihao Zhu, Jieru Mei, Siyuan Qiao, Hang Yan, Yukun Zhu, Liang-Chieh Chen, Henrik Kretzschmar
Finally, we directly project the superpixel class predictions back into the pixel space using the associations between the superpixels and the image pixel features.
1 code implementation • NeurIPS 2023 • Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen
The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining.
Ranked #1 on
Open Vocabulary Semantic Segmentation
on Cityscapes
Open Vocabulary Panoptic Segmentation
Open Vocabulary Semantic Segmentation
+2
1 code implementation • NeurIPS 2023 • Shuyang Sun, Weijun Wang, Qihang Yu, Andrew Howard, Philip Torr, Liang-Chieh Chen
This paper presents a new mechanism to facilitate the training of mask transformers for efficient panoptic segmentation, democratizing its deployment.
no code implementations • 10 Apr 2023 • Inkyu Shin, Dahun Kim, Qihang Yu, Jun Xie, Hong-Seok Kim, Bradley Green, In So Kweon, Kuk-Jin Yoon, Liang-Chieh Chen
The meta architecture of the proposed Video-kMaX consists of two components: within clip segmenter (for clip-level segmentation) and cross-clip associater (for association beyond clips).
1 code implementation • 30 Mar 2023 • Lucas Beyer, Bo Wan, Gagan Madan, Filip Pavetic, Andreas Steiner, Alexander Kolesnikov, André Susano Pinto, Emanuele Bugliarello, Xiao Wang, Qihang Yu, Liang-Chieh Chen, Xiaohua Zhai
A key finding is that a small decoder learned on top of a frozen pretrained encoder works surprisingly well.
2 code implementations • 4 Oct 2022 • Chenglin Yang, Siyuan Qiao, Qihang Yu, Xiaoding Yuan, Yukun Zhu, Alan Yuille, Hartwig Adam, Liang-Chieh Chen
The tiny-MOAT family is also benchmarked on downstream tasks, serving as a baseline for the community.
Ranked #1 on
Object Detection
on MS COCO
3 code implementations • 8 Jul 2022 • Qihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
However, we observe that most existing transformer-based vision models simply borrow the idea from NLP, neglecting the crucial difference between languages and images, particularly the extremely large sequence length of spatially flattened pixel features.
Ranked #2 on
Panoptic Segmentation
on COCO test-dev
2 code implementations • CVPR 2022 • Qihang Yu, Huiyu Wang, Dahun Kim, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
We propose Clustering Mask Transformer (CMT-DeepLab), a transformer-based framework for panoptic segmentation designed around clustering.
Ranked #6 on
Panoptic Segmentation
on COCO test-dev
1 code implementation • 15 Jun 2022 • Jieru Mei, Alex Zihao Zhu, Xinchen Yan, Hang Yan, Siyuan Qiao, Yukun Zhu, Liang-Chieh Chen, Henrik Kretzschmar, Dragomir Anguelov
We therefore present the Waymo Open Dataset: Panoramic Video Panoptic Segmentation Dataset, a large-scale dataset that offers high-quality panoptic segmentation labels for autonomous driving.
no code implementations • CVPR 2022 • Dahun Kim, Jun Xie, Huiyu Wang, Siyuan Qiao, Qihang Yu, Hong-Seok Kim, Hartwig Adam, In So Kweon, Liang-Chieh Chen
We present TubeFormer-DeepLab, the first attempt to tackle multiple core video segmentation tasks in a unified manner.
4 code implementations • 17 Jun 2021 • Mark Weber, Huiyu Wang, Siyuan Qiao, Jun Xie, Maxwell D. Collins, Yukun Zhu, Liangzhe Yuan, Dahun Kim, Qihang Yu, Daniel Cremers, Laura Leal-Taixe, Alan L. Yuille, Florian Schroff, Hartwig Adam, Liang-Chieh Chen
DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a state-of-the-art and easy-to-use TensorFlow codebase for general dense pixel prediction problems in computer vision.
1 code implementation • 23 Feb 2021 • Mark Weber, Jun Xie, Maxwell Collins, Yukun Zhu, Paul Voigtlaender, Hartwig Adam, Bradley Green, Andreas Geiger, Bastian Leibe, Daniel Cremers, Aljoša Ošep, Laura Leal-Taixé, Liang-Chieh Chen
The task of assigning semantic classes and track identities to every pixel in a video is called video panoptic segmentation.
1 code implementation • CVPR 2021 • Siyuan Qiao, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
We name this joint task as Depth-aware Video Panoptic Segmentation, and propose a new evaluation metric along with two derived datasets for it, which will be made available to the public.
Ranked #1 on
Video Panoptic Segmentation
on Cityscapes-VPS
(using extra training data)
Depth-aware Video Panoptic Segmentation
Monocular Depth Estimation
+2
3 code implementations • CVPR 2021 • Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
As a result, MaX-DeepLab shows a significant 7. 1% PQ gain in the box-free regime on the challenging COCO dataset, closing the gap between box-based and box-free methods for the first time.
Ranked #12 on
Panoptic Segmentation
on COCO test-dev
no code implementations • 23 Nov 2020 • Liang-Chieh Chen, Huiyu Wang, Siyuan Qiao
The Wide Residual Networks (Wide-ResNets), a shallow but wide model variant of the Residual Networks (ResNets) by stacking a small number of residual blocks with large channel sizes, have demonstrated outstanding performance on multiple dense prediction tasks.
Ranked #2 on
Panoptic Segmentation
on Cityscapes test
(using extra training data)
2 code implementations • 23 Oct 2020 • Ting Liu, Jennifer J. Sun, Long Zhao, Jiaping Zhao, Liangzhe Yuan, Yuxiao Wang, Liang-Chieh Chen, Florian Schroff, Hartwig Adam
Recognition of human poses and actions is crucial for autonomous systems to interact smoothly with people.
6 code implementations • CVPR 2021 • Siyuan Qiao, Liang-Chieh Chen, Alan Yuille
In this paper, we explore this mechanism in the backbone design for object detection.
Ranked #4 on
Object Detection
on AI-TOD
1 code implementation • ECCV 2020 • Liang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D. Collins, Ekin D. Cubuk, Barret Zoph, Hartwig Adam, Jonathon Shlens
We view this work as a notable step towards building a simple procedure to harness unlabeled video sequences and extra images to surpass state-of-the-art performance on core computer vision tasks.
5 code implementations • ECCV 2020 • Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions.
Ranked #4 on
Panoptic Segmentation
on Cityscapes val
(using extra training data)
2 code implementations • ECCV 2020 • Jennifer J. Sun, Jiaping Zhao, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Ting Liu
Depictions of similar human body configurations can vary with changing viewpoints.
Ranked #1 on
Pose Retrieval
on MPI-INF-3DHP
9 code implementations • CVPR 2020 • Bowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh Chen
In this work, we introduce Panoptic-DeepLab, a simple, strong, and fast system for panoptic segmentation, aiming to establish a solid baseline for bottom-up methods that can achieve comparable performance of two-stage methods while yielding fast inference speed.
Ranked #6 on
Panoptic Segmentation
on Cityscapes test
(using extra training data)
1 code implementation • ICCV 2019 • Jyh-Jing Hwang, Stella X. Yu, Jianbo Shi, Maxwell D. Collins, Tien-Ju Yang, Xiao Zhang, Liang-Chieh Chen
The proposed SegSort further produces an interpretable result, as each choice of label can be easily understood from the retrieved nearest segments.
Ranked #10 on
Unsupervised Semantic Segmentation
on PASCAL VOC 2012 val
(using extra training data)
2 code implementations • 10 Oct 2019 • Bowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh Chen
The semantic segmentation branch is the same as the typical design of any semantic segmentation model (e. g., DeepLab), while the instance segmentation branch is class-agnostic, involving a simple instance center regression.
no code implementations • ICCV 2019 • Bowen Cheng, Liang-Chieh Chen, Yunchao Wei, Yukun Zhu, Zilong Huang, JinJun Xiong, Thomas Huang, Wen-mei Hwu, Honghui Shi
The multi-scale context module refers to the operations to aggregate feature responses from a large spatial extent, while the single-stage encoder-decoder structure encodes the high-level semantic information in the encoder path and recovers the boundary information in the decoder path.
63 code implementations • ICCV 2019 • Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, Hartwig Adam
We achieve new state of the art results for mobile classification, detection and segmentation.
Ranked #9 on
Dichotomous Image Segmentation
on DIS-TE1
3 code implementations • CVPR 2019 • Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, Liang-Chieh Chen
Many of the recent successful methods for video object segmentation (VOS) are overly complicated, heavily rely on fine-tuning on the first frame, and/or are slow, and are hence of limited practical use.
Ranked #1 on
Semi-Supervised Video Object Segmentation
on YouTube
no code implementations • 13 Feb 2019 • Tien-Ju Yang, Maxwell D. Collins, Yukun Zhu, Jyh-Jing Hwang, Ting Liu, Xiao Zhang, Vivienne Sze, George Papandreou, Liang-Chieh Chen
We present a single-shot, bottom-up approach for whole image parsing.
Ranked #32 on
Panoptic Segmentation
on Cityscapes val
12 code implementations • CVPR 2019 • Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan Yuille, Li Fei-Fei
Therefore, we propose to search the network level structure in addition to the cell level structure, which forms a hierarchical architecture search space.
Ranked #7 on
Semantic Segmentation
on PASCAL VOC 2012 val
1 code implementation • NeurIPS 2018 • Liang-Chieh Chen, Maxwell D. Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, Jonathon Shlens
Recent progress has demonstrated that such meta-learning methods may exceed scalable human-invented architectures on image classification tasks.
Ranked #1 on
Human Part Segmentation
on PASCAL-Person-Part
3 code implementations • ECCV 2018 • George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros Gidaris, Jonathan Tompson, Kevin Murphy
We present a box-free bottom-up approach for the tasks of pose estimation and instance segmentation of people in multi-person images using an efficient single-shot model.
Ranked #8 on
Multi-Person Pose Estimation
on COCO test-dev
78 code implementations • ECCV 2018 • Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig Adam
The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information.
Ranked #1 on
Semantic Segmentation
on PASCAL VOC 2012 val
(mIoU (Syn) metric)
156 code implementations • CVPR 2018 • Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen
In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes.
Ranked #7 on
Retinal OCT Disease Classification
on OCT2017
no code implementations • CVPR 2018 • Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, Hartwig Adam
Within each region of interest, MaskLab performs foreground/background segmentation by combining semantic and direction prediction.
Ranked #87 on
Instance Segmentation
on COCO test-dev
(using extra training data)
1 code implementation • 18 Jul 2017 • Zbigniew Wojna, Vittorio Ferrari, Sergio Guadarrama, Nathan Silberman, Liang-Chieh Chen, Alireza Fathi, Jasper Uijlings
Many machine vision applications, such as semantic segmentation and depth prediction, require predictions for every pixel of the input image.
77 code implementations • 17 Jun 2017 • Liang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig Adam
To handle the problem of segmenting objects at multiple scales, we design modules which employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates.
Ranked #3 on
Semantic Segmentation
on PASCAL VOC 2012 test
(using extra training data)
47 code implementations • 2 Jun 2016 • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. Yuille
ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales.
1 code implementation • ICCV 2015 • George Papandreou, Liang-Chieh Chen, Kevin P. Murphy, Alan L. Yuille
Deep convolutional neural networks (DCNNs) trained on a large number of images with strong pixel-level annotations have recently significantly pushed the state-of-art in semantic image segmentation.
no code implementations • 21 Nov 2015 • Fangting Xia, Peng Wang, Liang-Chieh Chen, Alan L. Yuille
To tackle these difficulties, we propose a "Hierarchical Auto-Zoom Net" (HAZN) for object part parsing which adapts to the local scales of objects and parts.
Ranked #8 on
Human Part Segmentation
on PASCAL-Part
no code implementations • 18 Nov 2015 • Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, Ram Nevatia
ABC-CNN determines an attention map for an image-question pair by convolving the image feature map with configurable convolutional kernels derived from the question's semantics.
no code implementations • CVPR 2016 • Liang-Chieh Chen, Jonathan T. Barron, George Papandreou, Kevin Murphy, Alan L. Yuille
Deep convolutional neural networks (CNNs) are the backbone of state-of-art semantic image segmentation systems.
no code implementations • CVPR 2016 • Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, Alan L. Yuille
We adapt a state-of-the-art semantic image segmentation model, which we jointly train with multi-scale input images and the attention model.
3 code implementations • 9 Feb 2015 • George Papandreou, Liang-Chieh Chen, Kevin Murphy, Alan L. Yuille
Deep convolutional neural networks (DCNNs) trained on a large number of images with strong pixel-level annotations have recently significantly pushed the state-of-art in semantic image segmentation.
18 code implementations • 22 Dec 2014 • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. Yuille
This is due to the very invariance properties that make DCNNs good for high level tasks.
Ranked #3 on
Scene Segmentation
on SUN-RGBD
no code implementations • 9 Jul 2014 • Liang-Chieh Chen, Alexander G. Schwing, Alan L. Yuille, Raquel Urtasun
Towards this goal, we propose a training algorithm that is able to learn structured models jointly with deep features that form the MRF potentials.
1 code implementation • CVPR 2014 • Liang-Chieh Chen, Sanja Fidler, Alan L. Yuille, Raquel Urtasun
Labeling large-scale datasets with very accurate object segmentations is an elaborate task that requires a high degree of quality control and a budget of tens or hundreds of thousands of dollars.
no code implementations • CVPR 2014 • George Papandreou, Liang-Chieh Chen, Alan L. Yuille
As an alternative, we develop a generative model for the raw intensity of image patches and show that it can support image classification performance on par with optimized SIFT-based techniques in a bag-of-visual-words setting.