1 code implementation • 2 May 2024 • Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, Jose M. Alvarez
We further propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model with comprehensive visual question-answering (VQA) tasks, including scene description, traffic regulation, 3D grounding, counterfactual reasoning, decision making and planning.
no code implementations • 1 Apr 2024 • Shuaiyi Huang, De-An Huang, Zhiding Yu, Shiyi Lan, Subhashree Radhakrishnan, Jose M. Alvarez, Abhinav Shrivastava, Anima Anandkumar
Video instance segmentation (VIS) is a challenging vision task that aims to detect, segment, and track objects in videos.
no code implementations • 31 Jan 2024 • Shijia Liao, Shiyi Lan, Arun George Zachariah
The advent of Large Models marks a new era in machine learning, significantly outperforming smaller models by leveraging vast datasets to capture and synthesize complex patterns.
Ranked #1 on Speech Synthesis on LibriTTS
1 code implementation • ICCV 2023 • Bingyin Zhao, Zhiding Yu, Shiyi Lan, Yutao Cheng, Anima Anandkumar, Yingjie Lao, Jose M. Alvarez
With the proposed STL framework, our best model based on FAN-L-Hybrid (77. 3M parameters) achieves 84. 8% Top-1 accuracy and 42. 1% mCE on ImageNet-1K and ImageNet-C, and sets a new state-of-the-art for ImageNet-A (46. 1%) and ImageNet-R (56. 6%) without using extra data, outperforming the original FAN counterpart by significant margins.
Ranked #16 on Domain Generalization on ImageNet-C
1 code implementation • 21 Dec 2023 • Junfei Xiao, Ziqi Zhou, Wenxuan Li, Shiyi Lan, Jieru Mei, Zhiding Yu, Alan Yuille, Yuyin Zhou, Cihang Xie
Instead of relying solely on category-specific annotations, ProLab uses descriptive properties grounded in common sense knowledge for supervising segmentation models.
1 code implementation • 5 Dec 2023 • Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, Jose M. Alvarez
We initially observed that the nuScenes dataset, characterized by relatively simple driving scenarios, leads to an under-utilization of perception information in end-to-end models incorporating ego status, such as the ego vehicle's velocity.
no code implementations • 4 Dec 2023 • Zhenxin Li, Shiyi Lan, Jose M. Alvarez, Zuxuan Wu
Recently, the rise of query-based Transformer decoders is reshaping camera-based 3D object detection.
1 code implementation • 30 Nov 2023 • Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, Zuxuan Wu
With this in mind, we propose a simple yet effective approach to optimize VLMs in fine-grained understanding, achieving significant improvements on SPEC without compromising the zero-shot performance.
1 code implementation • 24 Nov 2023 • Lingchen Meng, Shiyi Lan, Hengduo Li, Jose M. Alvarez, Zuxuan Wu, Yu-Gang Jiang
In-context segmentation aims at segmenting novel images using a few labeled example images, termed as "in-context examples", exploring content similarities between examples and the target.
1 code implementation • 30 Oct 2023 • Ali Hatamizadeh, Michael Ranzinger, Shiyi Lan, Jose M. Alvarez, Sanja Fidler, Jan Kautz
Inspired by this trend, we propose a new class of computer vision models, dubbed Vision Retention Networks (ViR), with dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
1 code implementation • 8 Aug 2023 • Yilun Chen, Zhiding Yu, Yukang Chen, Shiyi Lan, Animashree Anandkumar, Jiaya Jia, Jose Alvarez
For 3D object detection, we instantiate this method as FocalFormer3D, a simple yet effective detector that excels at excavating difficult objects and improving prediction recall.
Ranked #8 on 3D Object Detection on nuScenes
1 code implementation • 4 Jul 2023 • Zhiqi Li, Zhiding Yu, David Austin, Mingsheng Fang, Shiyi Lan, Jan Kautz, Jose M. Alvarez
This technical report summarizes the winning solution for the 3D Occupancy Prediction Challenge, which is held in conjunction with the CVPR 2023 Workshop on End-to-End Autonomous Driving and CVPR 23 Workshop on Vision-Centric Autonomous Driving Workshop.
Ranked #1 on Prediction Of Occupancy Grid Maps on Occ3D-nuScenes
no code implementations • 9 Feb 2023 • Zhuolin Yang, Wei Ping, Zihan Liu, Vijay Korthikanti, Weili Nie, De-An Huang, Linxi Fan, Zhiding Yu, Shiyi Lan, Bo Li, Ming-Yu Liu, Yuke Zhu, Mohammad Shoeybi, Bryan Catanzaro, Chaowei Xiao, Anima Anandkumar
Augmenting pretrained language models (LMs) with a vision encoder (e. g., Flamingo) has obtained the state-of-the-art results in image-to-text generation.
no code implementations • CVPR 2023 • Shiyi Lan, Xitong Yang, Zhiding Yu, Zuxuan Wu, Jose M. Alvarez, Anima Anandkumar
We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations.
1 code implementation • ICCV 2023 • Yilun Chen, Zhiding Yu, Yukang Chen, Shiyi Lan, Anima Anandkumar, Jiaya Jia, Jose M. Alvarez
For 3D object detection, we instantiate this method as FocalFormer3D, a simple yet effective detector that excels at excavating difficult objects and improving prediction recall.
1 code implementation • 23 Oct 2022 • Junfei Xiao, Zhichao Xu, Shiyi Lan, Zhiding Yu, Alan Yuille, Anima Anandkumar
The model is trained on a composite dataset consisting of images from 9 datasets (ADE20K, Cityscapes, Mapillary Vistas, ScanNet, VIPER, WildDash 2, IDD, BDD, and COCO) with a simple dataset balancing strategy.
no code implementations • CVPR 2022 • Lingchen Meng, Hengduo Li, Bor-Chun Chen, Shiyi Lan, Zuxuan Wu, Yu-Gang Jiang, Ser-Nam Lim
To this end, we introduce AdaViT, an adaptive computation framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use throughout the backbone on a per-input basis, aiming to improve inference efficiency of vision transformers with a minimal drop of accuracy for image recognition.
3 code implementations • ICCV 2021 • Shiyi Lan, Zhiding Yu, Christopher Choy, Subhashree Radhakrishnan, Guilin Liu, Yuke Zhu, Larry S. Davis, Anima Anandkumar
We introduce DiscoBox, a novel framework that jointly learns instance segmentation and semantic correspondence using bounding box supervision.
1 code implementation • 24 Apr 2021 • Tianrui Guan, Jun Wang, Shiyi Lan, Rohan Chandra, Zuxuan Wu, Larry Davis, Dinesh Manocha
We present a novel architecture for 3D object detection, M3DeTR, which combines different point cloud representations (raw, voxels, bird-eye view) with different feature scales based on multi-scale feature pyramids.
Ranked #1 on 3D Object Detection on KITTI Cars Hard val
no code implementations • ECCV 2020 • Jun Wang, Shiyi Lan, Mingfei Gao, Larry S. Davis
Results show that our framework achieves the state-of-the-art performance with 31 FPS and improves our baseline significantly by 9. 0% mAP on the nuScenes test set.
Ranked #328 on 3D Object Detection on nuScenes
1 code implementation • CVPR 2020 • Shiyi Lan, Zhou Ren, Yi Wu, Larry S. Davis, Gang Hua
Object detection is an essential step towards holistic scene understanding.
Ranked #205 on Object Detection on COCO test-dev
2 code implementations • CVPR 2019 • Shiyi Lan, Ruichi Yu, Gang Yu, Larry S. Davis
This encourages the network to preserve the geometric structure in Euclidean space throughout the feature extraction hierarchy.
3 code implementations • CVPR 2017 • Hexiang Hu, Shiyi Lan, Yuning Jiang, Zhimin Cao, Fei Sha
Objects appear to scale differently in natural images.