The semantic grouping is performed by assigning pixels to a set of learnable prototypes, which can adapt to each sample by attentive pooling over the feature and form new slots.
Experiments show that our method can significantly boost the performance of query-based detectors in crowded scenes.
Ranked #1 on Object Detection on CrowdHuman (full body)
We propose a simple yet effective proposal-based object detector, aiming at detecting highly-overlapped instances in crowded scenes.
Ranked #2 on Object Detection on CrowdHuman (full body)
Segmenting primary objects in a video is an important yet challenging problem in computer vision, as it exhibits various levels of foreground/background ambiguities.
By applying CCNN on each video frame, the spatial foregroundness and backgroundness maps can be initialized, which are then propagated between various frames so as to segment primary video objects and suppress distractors.
Toward this end, this paper proposes two-stream fixation-semantic CNNs, whose architecture is inspired by the fact that salient objects in complex images can be unambiguously annotated by selecting the pre-segmented semantic objects that receive the highest fixation density in eye-tracking experiments.
Finding what is and what is not a salient object can be helpful in developing better features and models in salient object detection (SOD).