Universal Instance Perception as Object Discovery and Retrieval

12 Mar 2023  ·  Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Zehuan Yuan, Huchuan Lu ·

All instance perception tasks aim at finding certain objects specified by some queries such as category names, language expressions, and target annotations, but this complete field has been split into multiple independent subtasks. In this work, we present a universal instance perception model of the next generation, termed UNINEXT. UNINEXT reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm and can flexibly perceive different types of objects by simply changing the input prompts. This unified formulation brings the following benefits: (1) enormous data from different tasks and label vocabularies can be exploited for jointly training general instance-level representations, which is especially beneficial for tasks lacking in training data. (2) the unified model is parameter-efficient and can save redundant computation when handling multiple tasks simultaneously. UNINEXT shows superior performance on 20 challenging benchmarks from 10 instance-level tasks including classical image-level tasks (object detection and instance segmentation), vision-and-language tasks (referring expression comprehension and segmentation), and six video-level object tracking tasks. Code is available at https://github.com/MasterBin-IIAU/UNINEXT.

PDF Abstract

Results from the Paper


 Ranked #1 on Referring Expression Comprehension on RefCOCOg-test (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Multiple Object Tracking BDD100K val UNINEXT-H mMOTA 44.2 # 2
mIDF1 56.7 # 1
Multi-Object Tracking and Segmentation BDD100K val UNINEXT-H mMOTSA 35.7 # 1
Object Detection COCO minival UNINEXT-H box AP 60.6 # 15
AP50 77.5 # 3
AP75 66.7 # 2
APS 45.1 # 2
APM 64.8 # 2
APL 75.3 # 2
Instance Segmentation COCO test-dev UNINEXT-H mask AP 51.8 # 12
AP50 76.2 # 3
AP75 56.7 # 3
APS 33.3 # 3
APM 55.9 # 2
APL 67.5 # 3
Referring Expression Segmentation DAVIS 2017 (val) UNINEXT-H J&F 1st frame 72.5 # 1
Visual Object Tracking LaSOT UNINEXT-H AUC 72.2 # 3
Normalized Precision 80.8 # 4
Precision 79.4 # 2
Visual Object Tracking LaSOT UNINEXT-L AUC 72.4 # 2
Normalized Precision 80.7 # 5
Precision 78.9 # 3
Visual Object Tracking LaSOT-ext UNINEXT-H AUC 56.2 # 1
Normalized Precision 63.8 # 1
Precision 63.8 # 1
Video Instance Segmentation OVIS validation UNINEXT-H mask AP 49.0 # 1
AP50 72.5 # 1
AP75 52.2 # 1
Referring Expression Comprehension RefCoco+ UNINEXT-H Val 85.24 # 2
Test A 89.63 # 3
Test B 79.79 # 2
Referring Expression Comprehension RefCOCO UNINEXT-H Val 92.64 # 1
Test A 94.33 # 1
Test B 91.46 # 1
Referring Expression Comprehension RefCOCOg-test UNINEXT-H Accuracy 89.37 # 1
Referring Expression Comprehension RefCOCOg-val UNINEXT-H Accuracy 88.73 # 1
Referring Expression Segmentation RefCOCO testA UNINEXT-H Overall IoU 83.44 # 1
Referring Expression Segmentation RefCOCO+ testA UNINEXT-H Overall IoU 76.42 # 1
Referring Expression Segmentation RefCOCO testB UNINEXT-H Overall IoU 81.33 # 1
Referring Expression Segmentation RefCOCO+ test B UNINEXT-H Overall IoU 66.22 # 1
Referring Expression Segmentation RefCoCo val UNINEXT-H Overall IoU 82.19 # 1
Referring Expression Segmentation RefCOCO+ val UNINEXT-H Overall IoU 72.47 # 1
Referring Video Object Segmentation Refer-YouTube-VOS UNINEXT-H J&F 70.1 # 1
J 67.6 # 1
F 72.7 # 1
Referring Expression Segmentation Refer-YouTube-VOS (2021 public validation) UNINEXT-H J&F 70.1 # 1
J 67.6 # 1
F 72.7 # 1
Visual Tracking TNL2K UNINEXT-H precision 62.8 # 1
AUC 59.3 # 1
Visual Object Tracking TrackingNet UNINEXT-H Precision 86.4 # 1
Normalized Precision 89.0 # 3
Accuracy 85.4 # 2
Video Instance Segmentation YouTube-VIS validation UNINEXT-H mask AP 66.9 # 1
AP50 87.5 # 1
AP75 75.1 # 1

Methods


No methods listed for this paper. Add relevant methods here