Generally, with given pseudo ground-truths generated from the well-trained WSOD network, we propose a two-module iterative training algorithm to refine pseudo labels and supervise better object detector progressively.
Based on this understanding, in this paper, Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping by boosting its efficiency and increasing its robustness against the modality gap.
This paper presents a new method to solve keypoint detection and instance association by using Transformer.
Ranked #10 on Multi-Person Pose Estimation on COCO
After DETR was proposed, this novel transformer-based detection paradigm which performs several cross-attentions between object queries and feature maps for predictions has subsequently derived a series of transformer-based detection heads.
It can generate and fuse multi-scale features of the same spatial sizes by setting different dilation rates for different channels.
Graph Neural Network (GNN) has been demonstrated its effectiveness in dealing with non-Euclidean structural data.
Most existing CNN-based methods do well in visual representation, however, lacking in the ability to explicitly learn the constraint relationships between keypoints.
V2F-Net consists of two sub-networks: Visible region Detection Network (VDN) and Full body Estimation Network (FEN).
Ranked #1 on Object Detection on CityPersons
In recent years, knowledge distillation has been proved to be an effective solution for model compression.
However, for bottom-up methods, which need to handle a large variance of human scales and labeling ambiguities, the current practice seems unreasonable.
Instead, we focus on exploiting multi-scale information from layers with different receptive-field sizes and then making full of use this information by improving the fusion method.
To combine the distribution-level relations and instance-level relations for all examples, we construct a dual complete graph network which consists of a point graph and a distribution graph with each node standing for an example.
Ranked #2 on Few-Shot Learning on Mini-ImageNet - 1-Shot Learning
When aligning two groups of local features from two images, we view it as a graph matching problem and propose a cross-graph embedded-alignment (CGEA) layer to jointly learn and embed topology information to local features, and straightly predict similarity score.
To tackle this problem, we propose an efficient attention mechanism - Pose Refine Machine (PRM) to make a trade-off between local and global representations in output features and further refine the keypoint locations.
Ranked #1 on Keypoint Detection on COCO
In this paper, we propose a method, called GridFace, to reduce facial geometric variations and improve the recognition performance.