no code implementations • 29 Jun 2025 • Sophie Zhou, Shu Kong
The finetuned model greatly improves retrieval performance by 12\% AP over the baseline, though it unexpectedly results in a lower recognition accuracy (92. 7\%).
1 code implementation • 28 Jun 2025 • Yuzhu Wang, Manni Duan, Shu Kong
Visual Prompt Tuning (VPT) is a parameter-efficient fune-tuning technique that adapts a pre-trained vision Transformer (ViT) by learning a small set of parameters in the input space, known as prompts.
no code implementations • 5 Jun 2025 • Hanxin Wang, Tian Liu, Shu Kong
SRAPF consists of two stages: (1) partial finetuning the visual encoder using both ID and retrieved data, and (2) adversarial partial finetuning with few-shot ID data.
no code implementations • CVPR 2025 • Qianqian Shen, Yunhan Zhao, Nahyun Kwon, Jeeeun Kim, Yanan Li, Shu Kong
Instance detection (InsDet) aims to localize specific object instances within a novel scene imagery based on given visual references.
1 code implementation • 2 Oct 2024 • Hasnat Md Abdullah, Tian Liu, Kangda Wei, Shu Kong, Ruihong Huang
To explore foundation models' capability in localizing unusual activity, we introduce UAL-Bench, a comprehensive benchmark for unusual activity localization, featuring three video datasets: UAG-OOPS, UAG-SSBD, UAG-FunQA, and an instruction-tune dataset: OOPS-UAG-Instruct, to improve model capabilities.
1 code implementation • 22 Sep 2024 • Anirudh S Chakravarthy, Meghana Reddy Ganesina, Peiyun Hu, Laura Leal-Taixe, Shu Kong, Deva Ramanan, Aljosa Osep
To address this unrealistic assumption, we study LPS in the Open World (LiPSOW): we train models on a dataset with a pre-defined semantic class vocabulary and study their generalization to a larger dataset where novel instances of thing and stuff classes can appear.
1 code implementation • 22 Jul 2024 • Jia Shi, Gautam Gare, Jinjin Tian, Siqi Chai, Zhiqiu Lin, Arun Vasudevan, Di Feng, Francesco Ferroni, Shu Kong
We assess 75 models using ImageNet as the ID dataset and five significantly shifted OOD variants, uncovering a strong linear correlation between ID LCA distance and OOD top-1 accuracy.
1 code implementation • CVPR 2025 • Tian Liu, Huixin Zhang, Shubham Parashar, Shu Kong
Second, more surprisingly, we find that simply finetuning a VLM solely on few-shot examples significantly outperforms previous FSR methods, and finetuning on the mix of retrieved and few-shot data yields even better results.
1 code implementation • 25 Apr 2024 • Samia Shafique, Shu Kong, Charless Fowlkes
Moreover, all existing methods match crime-scene shoeprints to clean reference prints, yet our analysis shows matching to more informative tread depth maps yields better retrieval results.
no code implementations • 1 Apr 2024 • Yechi Ma, Shuoquan Wei, Churun Zhang, Wei Hua, Yanan Li, Shu Kong
Our method builds on a key insight that, compared with 3D detectors, a 2D detector is much easier to train and performs significantly better w. r. t detections on the 2D image plane.
no code implementations • CVPR 2024 • Xiaogang Xu, Shu Kong, Tao Hu, Zhe Liu, Hujun Bao
Pre-trained models with large-scale training data, such as CLIP and Stable Diffusion, have demonstrated remarkable performance in various high-level computer vision tasks such as image understanding and generation from language descriptions.
no code implementations • 29 Jan 2024 • Nahyun Kwon, Qian Lu, Muhammad Hasham Qazi, Joanne Liu, Changhoon Oh, Shu Kong, Jeeeun Kim
In our increasingly diverse society, everyday physical interfaces often present barriers, impacting individuals across various contexts.
no code implementations • CVPR 2024 • Shubham Parashar, Zhiqiu Lin, Tian Liu, Xiangjue Dong, Yanan Li, Deva Ramanan, James Caverlee, Shu Kong
We address this by using large language models (LLMs) to count the number of pretraining texts that contain synonyms of these concepts.
1 code implementation • 22 Dec 2023 • Anish Madan, Neehar Peri, Shu Kong, Deva Ramanan
Concretely, we propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external data and fine-tuned on multi-modal (text and visual) K-shot examples per target class.
1 code implementation • 18 Dec 2023 • Yechi Ma, Neehar Peri, Shuoquan Wei, Achal Dave, Wei Hua, Yanan Li, Deva Ramanan, Shu Kong
Contemporary autonomous vehicle (AV) benchmarks have advanced techniques for training 3D detectors, particularly on large-scale multi-modal (LiDAR + RGB) data.
1 code implementation • CVPR 2024 • Yunhan Zhao, Haoyu Ma, Shu Kong, Charless Fowlkes
We explore this problem by first introducing a new benchmark dataset, consisting of RGB and depth videos, per-frame camera pose, and instance-level annotations in both 2D camera and 3D world coordinates.
1 code implementation • CVPR 2024 • Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
Alpha-CLIP not only preserves the visual recognition ability of CLIP but also enables precise control over the emphasis of image contents.
1 code implementation • 30 Oct 2023 • Qianqian Shen, Yunhan Zhao, Nahyun Kwon, Jeeeun Kim, Yanan Li, Shu Kong
Instance detection (InsDet) is a long-lasting problem in robotics and computer vision, aiming to detect object instances (predefined by some visual examples) in a cluttered scene.
no code implementations • 15 Oct 2023 • Shubham Parashar, Zhiqiu Lin, Yanan Li, Shu Kong
We find that common names are more likely to be included in CLIP's training set, and prompting them achieves 2$\sim$5 times higher accuracy on benchmarking datasets of fine-grained species recognition.
1 code implementation • NeurIPS 2023 • Meng Wei, Xiaoyu Yue, Wenwei Zhang, Shu Kong, Xihui Liu, Jiangmiao Pang
Secondly, part segmentation introduces an open granularity challenge due to the diverse and often ambiguous definitions of parts in the open world.
Open Vocabulary Semantic Segmentation
Open-Vocabulary Semantic Segmentation
+1
1 code implementation • 26 May 2023 • Yuzhu Wang, Lechao Cheng, Manni Duan, Yongheng Wang, Zunlei Feng, Shu Kong
Finally, we propose a rather simple loss term (dubbed ND loss) to simultaneously (1) encourage student to produce large-\emph{norm} features, and (2) align the \emph{direction} of student features and teacher class-means.
Ranked #1 on
Knowledge Distillation
on COCO 2017 val
no code implementations • 25 Nov 2022 • Shubham Gupta, Jeet Kanjani, Mengtian Li, Francesco Ferroni, James Hays, Deva Ramanan, Shu Kong
We focus on the task of far-field 3D detection (Far3Det) of objects beyond a certain distance from an observer, e. g., $>$50m.
1 code implementation • 16 Nov 2022 • Neehar Peri, Achal Dave, Deva Ramanan, Shu Kong
Moreover, semantic classes are often organized within a hierarchy, e. g., tail classes such as child and construction-worker are arguably subclasses of pedestrian.
no code implementations • 10 Oct 2022 • Zhiqiu Lin, Deepak Pathak, Yu-Xiong Wang, Deva Ramanan, Shu Kong
LECO requires learning classifiers in distinct time periods (TPs); each TP introduces a new ontology of "fine" labels that refines old ontologies of "coarse" labels (e. g., dog breeds that refine the previous ${\tt dog}$).
1 code implementation • 4 May 2022 • Samia Shafique, Bailey Kong, Shu Kong, Charless C. Fowlkes
We develop a method termed ShoeRinsics that learns to predict depth by leveraging a mix of fully supervised synthetic data and unsupervised retail image data.
2 code implementations • CVPR 2022 • Shaden Alshammari, Yu-Xiong Wang, Deva Ramanan, Shu Kong
In contrast, weight decay penalizes larger weights more heavily and so learns small balanced weights; the MaxNorm constraint encourages growing small weights within a norm ball but caps all the weights by the radius.
Ranked #9 on
Long-tail Learning
on CIFAR-100-LT (ρ=10)
3 code implementations • 7 Apr 2021 • Yi-Ting Chen, Jinghao Shi, Zelin Ye, Christoph Mertz, Deva Ramanan, Shu Kong
Object detection with multimodal inputs can improve many safety-critical systems such as autonomous vehicles (AVs).
Ranked #2 on
Object Detection
on InOutDoor
1 code implementation • ICCV 2021 • Shu Kong, Deva Ramanan
However, the former generalizes poorly to diverse open test data due to overfitting to the training outliers, which are unlikely to exhaustively span the open-world.
no code implementations • 1 Jan 2021 • Shu Kong, Deva Ramanan
Machine-learned safety-critical systems need to be self-aware and reliably know their unknowns in the open-world.
1 code implementation • CVPR 2021 • Yunhan Zhao, Shu Kong, Charless Fowlkes
We show that jointly applying the two methods improves depth prediction on images captured under uncommon and even never-before-seen camera poses.
no code implementations • 21 Jun 2020 • Zhiyuan Fang, Shu Kong, Zhe Wang, Charless Fowlkes, Yezhou Yang
The referring attention is our designed mechanism acting as a scoring function for grounding the given queries over frames temporally.
1 code implementation • 11 May 2020 • Linfeng Wang, Shu Kong, Zachary Pincus, Charless Fowlkes
The nematode Caenorhabditis elegans (C. elegans) serves as an important model organism in a wide variety of biological studies.
no code implementations • CVPR 2020 • Yunhan Zhao, Shu Kong, Daeyun Shin, Charless Fowlkes
In this setting, we find that existing domain translation approaches are difficult to train and offer little advantage over simple baselines that use a mix of real and synthetic data.
1 code implementation • CVPR 2019 • Zhiyuan Fang, Shu Kong, Charless Fowlkes, Yezhou Yang
Computer Vision applications often require a textual grounding module with precision, interpretability, and resilience to counterfactual inputs/queries.
2 code implementations • 2 Apr 2019 • Shu Kong, Charless Fowlkes
We introduce multigrid Predictive Filter Flow (mgPFF), a framework for unsupervised learning on videos.
2 code implementations • 28 Nov 2018 • Shu Kong, Charless Fowlkes
We propose a simple, interpretable framework for solving a wide range of image reconstruction problems such as denoising and deconvolution.
Ranked #32 on
Image Super-Resolution
on Set14 - 4x upscaling
1 code implementation • 3 May 2018 • Shu Kong, Charless Fowlkes
To achieve parsimonious inference in per-pixel labeling tasks with a limited computational budget, we propose a \emph{Pixel-wise Attentional Gating} unit (\emph{PAG}) that learns to selectively process a subset of spatial locations at each layer of a deep convolutional network.
Ranked #7 on
Semantic Segmentation
on KITTI Semantic Segmentation
no code implementations • 2 May 2018 • Feng Zhou, Shu Kong, Charless Fowlkes, Tao Chen, Baiying Lei
Specifically, we first mapped facial expressions into dimensional measures so that we transformed facial expression analysis from a classification problem to a regression one.
no code implementations • 1 May 2018 • Zhiyuan Fang, Shu Kong, Tianshu Yu, Yezhou Yang
Grounding textual phrases in visual content is a meaningful yet challenging problem with various potential applications such as image-text inference or text-driven multimedia interaction.
2 code implementations • CVPR 2018 • Shu Kong, Charless Fowlkes
We introduce a differentiable, end-to-end trainable framework for solving pixel-level grouping problems such as instance segmentation consisting of two novel components.
1 code implementation • CVPR 2018 • Shu Kong, Charless Fowlkes
We propose a depth-aware gating module that adaptively selects the pooling field size in a convolutional network architecture according to the object scale (inversely proportional to the depth) so that small details are preserved for distant objects while larger receptive fields are used for those nearby.
Ranked #40 on
Semantic Segmentation
on SUN-RGBD
(using extra training data)
no code implementations • CVPR 2017 • Shu Kong, Charless Fowlkes
To address the computational demands of high feature dimensionality, we propose to represent the covariance features as a matrix and apply a low-rank bilinear classifier.
2 code implementations • 6 Jun 2016 • Shu Kong, Xiaohui Shen, Zhe Lin, Radomir Mech, Charless Fowlkes
In this work, we propose to learn a deep convolutional neural network to rank photo aesthetics in which the relative ranking of photo aesthetics are directly modeled in the loss function.
Ranked #7 on
Aesthetics Quality Assessment
on AVA
no code implementations • 3 May 2016 • Shu Kong, Surangi Punyasena, Charless Fowlkes
We propose a robust approach for performing automatic species-level recognition of fossil pollen grains in microscopy images that exploits both global shape and local texture characteristics in a patch-based matching methodology.
1 code implementation • 2 Feb 2014 • Shu Kong, Zhuolin Jiang, Qiang Yang
However, measuring pairwise distance of RF's for building the similarity graph is a nontrivial problem.
no code implementations • 22 Jan 2014 • Shu Kong, Zhuolin Jiang, Qiang Yang
We now know that mid-level features can greatly enhance the performance of image learning, but how to automatically learn the image features efficiently and in an unsupervised manner is still an open question.