Rethinking pose estimation in crowds: overcoming the detection information-bottleneck and ambiguity
Frequent interactions between individuals are a fundamental challenge for pose estimation algorithms. Current pipelines either use an object detector together with a pose estimator (top-down approach), or localize all body parts first and then link them to predict the pose of individuals (bottom-up). Yet, when individuals closely interact, top-down methods are ill-defined due to overlapping individuals, and bottom-up methods often falsely infer connections to distant bodyparts. Thus, we propose a novel pipeline called bottom-up conditioned top-down pose estimation (BUCTD) that combines the strengths of bottom-up and top-down methods. Specifically, we propose to use a bottom-up model as the detector, which in addition to an estimated bounding box provides a pose proposal that is fed as condition to an attention-based top-down model. We demonstrate the performance and efficiency of our approach on animal and human pose estimation benchmarks. On CrowdPose and OCHuman, we outperform previous state-of-the-art models by a significant margin. We achieve 78.5 AP on CrowdPose and 48.5 AP on OCHuman, an improvement of 8.6% and 7.8% over the prior art, respectively. Furthermore, we show that our method strongly improves the performance on multi-animal benchmarks involving fish and monkeys. The code is available at https://github.com/amathislab/BUCTD
PDF AbstractCode
Datasets
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Multi-Person Pose Estimation | CrowdPose | BUCTD-W48 (w/cond. input from PETR, and generative sampling) | mAP @0.5:0.95 | 78.5 | # 2 | |
AP Easy | 83.9 | # 2 | ||||
AP Medium | 79.0 | # 2 | ||||
AP Hard | 72.3 | # 2 | ||||
Pose Estimation | CrowdPose | BUCTD-W48 (w/cond. input from PETR, and generative sampling) | AP | 78.5 | # 1 | |
AP Hard | 72.3 | # 1 | ||||
AP Easy | 83.9 | # 1 | ||||
AP Medium | 79.0 | # 2 | ||||
Pose Estimation | CrowdPose | BUCTD-W48 | AP | 72.9 | # 6 | |
Pose Estimation | CrowdPose | BUCTD-W48 (w/cond. input from PETR) | AP | 76.7 | # 3 | |
Animal Pose Estimation | Fish-100 | BUCTD-preNet-W48 (DLCRNet) | mAP | 88.7 | # 2 | |
Animal Pose Estimation | Fish-100 | BUCTD-preNet-W48 (CID-W32) | mAP | 88.0 | # 3 | |
Animal Pose Estimation | Fish-100 | HRNet-W48 + Faster R-CNN | mAP | 89.1 | # 1 | |
Animal Pose Estimation | Marmoset-8K | BUCTD-preNet-W48 (CID-W32) | mAP | 93.3 | # 1 | |
Animal Pose Estimation | Marmoset-8K | BUCTD-CoAM-W48 (DLCRNet) | mAP | 91.6 | # 3 | |
Animal Pose Estimation | Marmoset-8K | CID-W32 | mAP | 92.5 | # 2 | |
Pose Estimation | MS COCO | BUCTD (PETR, with generative sampling) | APM | 74.2 | # 2 | |
APL | 83.7 | # 3 | ||||
Pose Estimation | MS COCO | BUCTD (PETR, with generative sampling) | AP | 77.8 | # 4 | |
Pose Estimation | OCHuman | BUCTD (CID-W32) | Test AP | 47.2 | # 5 | |
Validation AP | 47.7 | # 5 | ||||
Animal Pose Estimation | TriMouse-161 | BUCTD-CoAM-W48 (DLCRNet) | mAP | 99.1 | # 1 | |
Animal Pose Estimation | TriMouse-161 | DLCRNet | mAP | 95.8 | # 3 | |
Animal Pose Estimation | TriMouse-161 | CID-W32 | mAP | 86.8 | # 6 |